DocumentConverter

The main class for converting documents.

Class Definition

Python
from docling.document_converter import DocumentConverter

converter = DocumentConverter(
    format=None,
    pipeline="default",
    vlm_model=None,
    ocr_enabled=True,
    ocr_language="eng"
)

Parameters

Parameter Type Default Description
format InputFormat None Specify the input format (PDF, DOCX, etc.). Auto-detected if None.
pipeline str "default" Processing pipeline to use ("default", "vlm", etc.)
vlm_model str None Visual Language Model to use (e.g., "granite_docling")
ocr_enabled bool True Enable OCR processing for scanned documents
ocr_language str "eng" OCR language code

Methods

convert(source)

Convert a document from a file path or URL.

Python
result = converter.convert("document.pdf")
result = converter.convert("https://example.com/document.pdf")

Parameters:

  • source (str) - File path or URL to the document

Returns: ConversionResult object

ConversionResult

Result object returned by DocumentConverter.convert().

Properties

  • document - DoclingDocument object containing the converted document
  • metadata - Dictionary containing conversion metadata

DoclingDocument

The unified document representation format.

Properties

  • pages - List of Page objects
  • tables - List of Table objects
  • images - List of Image objects
  • structure - Document structure information
  • metadata - Document metadata dictionary

Export Methods

export_to_markdown(table_format="grid", include_images=True)

Export document to Markdown format.

Parameters:

  • table_format (str) - Table format: "grid", "pipe", or "simple"
  • include_images (bool) - Whether to include image references

Returns: str - Markdown representation

export_to_html(include_styles=True, include_images=True)

Export document to HTML format.

Parameters:

  • include_styles (bool) - Include CSS styles
  • include_images (bool) - Include image references

Returns: str - HTML representation

export_to_dict()

Export document to Python dictionary (lossless JSON representation).

Returns: dict - Dictionary representation

export_to_doctags()

Export document to DocTags format for AI systems.

Returns: str - DocTags representation

Page

Represents a single page in a document.

Properties

  • page_number - Page number (1-indexed)
  • content - List of content elements
  • width - Page width
  • height - Page height

Methods

export_to_markdown()

Export page to Markdown.

Table

Represents a table extracted from a document.

Properties

  • rows - List of TableRow objects
  • columns - List of column headers

Methods

export_to_markdown()

Export table to Markdown format.

export_to_html()

Export table to HTML format.

Image

Represents an image extracted from a document.

Properties

  • filename - Image filename
  • image_type - Type of image (diagram, photo, etc.)
  • width - Image width
  • height - Image height

InputFormat

Enumeration of supported input formats.

Python
from docling.datamodel.base_models import InputFormat

InputFormat.PDF
InputFormat.DOCX
InputFormat.PPTX
InputFormat.XLSX
InputFormat.HTML
InputFormat.MARKDOWN
# ... and more

Exceptions

DoclingError

Base exception class for Docling-specific errors.

Python
from docling.exceptions import DoclingError

try:
    converter.convert("document.pdf")
except DoclingError as e:
    print(f"Docling error: {e}")

Complete Example

Python
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.exceptions import DoclingError
import json

# Initialize converter
converter = DocumentConverter(
    format=InputFormat.PDF,
    pipeline="default"
)

try:
    # Convert document
    result = converter.convert("document.pdf")
    doc = result.document
    
    # Access document properties
    print(f"Pages: {len(doc.pages)}")
    print(f"Tables: {len(doc.tables)}")
    print(f"Images: {len(doc.images)}")
    
    # Export to different formats
    markdown = doc.export_to_markdown()
    html = doc.export_to_html()
    json_data = doc.export_to_dict()
    
    # Work with pages
    for page in doc.pages:
        print(f"Page {page.page_number}: {len(page.content)} elements")
    
    # Work with tables
    for table in doc.tables:
        print(table.export_to_markdown())
    
except DoclingError as e:
    print(f"Error: {e}")

Additional Resources