Core Concepts
DocumentConverter
The DocumentConverter class is the main entry point for converting documents. It handles the conversion process and manages the pipeline.
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert(source)
Initialization Parameters
| Parameter | Type | Description |
|---|---|---|
| format | InputFormat | Specify the input format (PDF, DOCX, etc.) |
| pipeline | str | Processing pipeline to use (default, vlm, etc.) |
| vlm_model | str | Visual Language Model to use (e.g., granite_docling) |
Conversion Result
The convert() method returns a result object containing the converted document and metadata.
result = converter.convert("document.pdf")
doc = result.document # Access the document
metadata = result.metadata # Access metadata
DoclingDocument
The DoclingDocument is the unified representation of all converted documents, regardless of source format.
Document Properties
pages- List of document pagestables- Extracted tablesimages- Document imagesstructure- Document structure informationmetadata- Document metadata
Export Methods
export_to_markdown()
Export the document to Markdown format:
markdown = doc.export_to_markdown()
# Optional parameters:
markdown = doc.export_to_markdown(
table_format="grid", # or "pipe", "simple"
include_images=True
)
Parameters
table_format- Markdown table format ("grid", "pipe", "simple")include_images- Whether to include image references
export_to_html()
Export the document to HTML format:
html = doc.export_to_html()
# Optional parameters:
html = doc.export_to_html(
include_styles=True,
include_images=True
)
export_to_dict()
Export the document to a Python dictionary (lossless JSON representation):
import json
doc_dict = doc.export_to_dict()
json_str = json.dumps(doc_dict, indent=2)
export_to_doctags()
Export the document to DocTags format for AI systems:
doctags = doc.export_to_doctags()
Configuration
Basic Configuration
Configure the converter with various options:
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
converter = DocumentConverter(
format=InputFormat.PDF,
pipeline="default"
)
VLM Pipeline
Use Visual Language Models for enhanced document understanding:
converter = DocumentConverter(
pipeline="vlm",
vlm_model="granite_docling"
)
result = converter.convert("document.pdf")
On Apple Silicon, MLX acceleration is automatically used when available.
OCR Configuration
Configure OCR settings for scanned documents:
converter = DocumentConverter(
ocr_enabled=True,
ocr_language="eng" # Language code
)
Working with Document Components
Accessing Pages
doc = result.document
# Iterate through pages
for page in doc.pages:
print(f"Page {page.page_number}: {len(page.content)} elements")
print(page.export_to_markdown())
Working with Tables
# Access all tables
for table in doc.tables:
print(f"Table with {len(table.rows)} rows")
print(table.export_to_markdown())
# Access table data
for row in table.rows:
for cell in row.cells:
print(cell.text)
Working with Images
# Access images
for image in doc.images:
print(f"Image: {image.filename}")
print(f"Type: {image.image_type}")
print(f"Size: {image.width}x{image.height}")
Error Handling
Handle errors gracefully in your applications:
from docling.document_converter import DocumentConverter
from docling.exceptions import DoclingError
try:
converter = DocumentConverter()
result = converter.convert("document.pdf")
except DoclingError as e:
print(f"Error converting document: {e}")
except FileNotFoundError:
print("Document file not found")
except Exception as e:
print(f"Unexpected error: {e}")
Performance Optimization
Batch Processing
Process multiple documents efficiently:
from concurrent.futures import ThreadPoolExecutor
from docling.document_converter import DocumentConverter
def convert_document(path):
converter = DocumentConverter()
return converter.convert(path)
documents = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
with ThreadPoolExecutor(max_workers=4) as executor:
results = list(executor.map(convert_document, documents))
Caching
Models are automatically cached after first download. To clear cache:
# Cache location is typically in:
# ~/.cache/docling/models/
Advanced Usage
Custom Processing Pipeline
Create custom processing pipelines for specific use cases:
# Custom configuration
converter = DocumentConverter(
pipeline="custom",
# Add custom pipeline configuration
)
Streaming Large Documents
For very large documents, process page by page:
converter = DocumentConverter()
result = converter.convert("large_document.pdf")
doc = result.document
# Process pages incrementally
for page in doc.pages:
process_page(page.export_to_markdown())
API Reference
For complete API reference, see the API documentation.
Best Practices
1. Use Virtual Environments
Always use a virtual environment to avoid dependency conflicts.
2. Handle Errors
Always wrap conversion calls in try-except blocks.
3. Choose the Right Format
Select export formats based on your use case:
- Markdown for documentation and human-readable output
- JSON for programmatic processing
- DocTags for AI/RAG systems
- HTML for web display
4. Optimize for Your Use Case
Use VLM pipeline for complex documents, default pipeline for speed.