Core Concepts

DocumentConverter

The DocumentConverter class is the main entry point for converting documents. It handles the conversion process and manages the pipeline.

Python
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert(source)

Initialization Parameters

Parameter Type Description
format InputFormat Specify the input format (PDF, DOCX, etc.)
pipeline str Processing pipeline to use (default, vlm, etc.)
vlm_model str Visual Language Model to use (e.g., granite_docling)

Conversion Result

The convert() method returns a result object containing the converted document and metadata.

Python
result = converter.convert("document.pdf")
doc = result.document  # Access the document
metadata = result.metadata  # Access metadata

DoclingDocument

The DoclingDocument is the unified representation of all converted documents, regardless of source format.

Document Properties

  • pages - List of document pages
  • tables - Extracted tables
  • images - Document images
  • structure - Document structure information
  • metadata - Document metadata

Export Methods

export_to_markdown()

Export the document to Markdown format:

Python
markdown = doc.export_to_markdown()
# Optional parameters:
markdown = doc.export_to_markdown(
    table_format="grid",  # or "pipe", "simple"
    include_images=True
)

Parameters

  • table_format - Markdown table format ("grid", "pipe", "simple")
  • include_images - Whether to include image references

export_to_html()

Export the document to HTML format:

Python
html = doc.export_to_html()
# Optional parameters:
html = doc.export_to_html(
    include_styles=True,
    include_images=True
)

export_to_dict()

Export the document to a Python dictionary (lossless JSON representation):

Python
import json
doc_dict = doc.export_to_dict()
json_str = json.dumps(doc_dict, indent=2)

export_to_doctags()

Export the document to DocTags format for AI systems:

Python
doctags = doc.export_to_doctags()

Configuration

Basic Configuration

Configure the converter with various options:

Python
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat

converter = DocumentConverter(
    format=InputFormat.PDF,
    pipeline="default"
)

VLM Pipeline

Use Visual Language Models for enhanced document understanding:

Python
converter = DocumentConverter(
    pipeline="vlm",
    vlm_model="granite_docling"
)

result = converter.convert("document.pdf")

On Apple Silicon, MLX acceleration is automatically used when available.

OCR Configuration

Configure OCR settings for scanned documents:

Python
converter = DocumentConverter(
    ocr_enabled=True,
    ocr_language="eng"  # Language code
)

Working with Document Components

Accessing Pages

Python
doc = result.document

# Iterate through pages
for page in doc.pages:
    print(f"Page {page.page_number}: {len(page.content)} elements")
    print(page.export_to_markdown())

Working with Tables

Python
# Access all tables
for table in doc.tables:
    print(f"Table with {len(table.rows)} rows")
    print(table.export_to_markdown())
    
    # Access table data
    for row in table.rows:
        for cell in row.cells:
            print(cell.text)

Working with Images

Python
# Access images
for image in doc.images:
    print(f"Image: {image.filename}")
    print(f"Type: {image.image_type}")
    print(f"Size: {image.width}x{image.height}")

Error Handling

Handle errors gracefully in your applications:

Python
from docling.document_converter import DocumentConverter
from docling.exceptions import DoclingError

try:
    converter = DocumentConverter()
    result = converter.convert("document.pdf")
except DoclingError as e:
    print(f"Error converting document: {e}")
except FileNotFoundError:
    print("Document file not found")
except Exception as e:
    print(f"Unexpected error: {e}")

Performance Optimization

Batch Processing

Process multiple documents efficiently:

Python
from concurrent.futures import ThreadPoolExecutor
from docling.document_converter import DocumentConverter

def convert_document(path):
    converter = DocumentConverter()
    return converter.convert(path)

documents = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]

with ThreadPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(convert_document, documents))

Caching

Models are automatically cached after first download. To clear cache:

Python
# Cache location is typically in:
# ~/.cache/docling/models/

Advanced Usage

Custom Processing Pipeline

Create custom processing pipelines for specific use cases:

Python
# Custom configuration
converter = DocumentConverter(
    pipeline="custom",
    # Add custom pipeline configuration
)

Streaming Large Documents

For very large documents, process page by page:

Python
converter = DocumentConverter()
result = converter.convert("large_document.pdf")
doc = result.document

# Process pages incrementally
for page in doc.pages:
    process_page(page.export_to_markdown())

API Reference

For complete API reference, see the API documentation.

Best Practices

1. Use Virtual Environments

Always use a virtual environment to avoid dependency conflicts.

2. Handle Errors

Always wrap conversion calls in try-except blocks.

3. Choose the Right Format

Select export formats based on your use case:

  • Markdown for documentation and human-readable output
  • JSON for programmatic processing
  • DocTags for AI/RAG systems
  • HTML for web display

4. Optimize for Your Use Case

Use VLM pipeline for complex documents, default pipeline for speed.

Additional Resources