Quick Start

In just a few lines of code, you can convert documents into structured formats ready for AI processing.

Basic Example

Here's the simplest way to convert a document:

Python
from docling.document_converter import DocumentConverter

# Convert a document from URL
source = "https://arxiv.org/pdf/2408.09869"
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())

That's it! You've converted a PDF to Markdown. The document is now ready for further processing.

Convert Local Files

You can also convert local files:

Python
from docling.document_converter import DocumentConverter

# Convert a local file
converter = DocumentConverter()
result = converter.convert("path/to/your/document.pdf")
print(result.document.export_to_markdown())

Understanding the Result

The convert() method returns a result object containing the converted document. Let's explore what you can do with it:

Python
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")
doc = result.document

# Export to different formats
markdown = doc.export_to_markdown()
html = doc.export_to_html()
json_data = doc.export_to_dict()

# Access document properties
print(f"Number of pages: {len(doc.pages)}")
print(f"Document structure: {doc.structure}")

Export Formats

Docling supports multiple export formats. Choose the one that best fits your use case:

Markdown

Perfect for documentation, content management, and human-readable output:

Python
markdown = doc.export_to_markdown()
print(markdown)

HTML

Rich HTML output with styling, suitable for web display:

Python
html = doc.export_to_html()
print(html)

JSON

Lossless JSON representation preserving all structure and metadata:

Python
import json
json_data = doc.export_to_dict()
print(json.dumps(json_data, indent=2))

DocTags

Structured document tags format for AI and RAG systems:

Python
doctags = doc.export_to_doctags()
print(doctags)

Learn more about export options in the documentation.

Working with Different Document Types

PDF Documents

PDFs are Docling's specialty. Advanced features include:

  • Layout detection and reading order
  • Table extraction
  • Formula recognition
  • Code block detection
  • Image classification
Python
# PDF with advanced processing
converter = DocumentConverter()
result = converter.convert("document.pdf")
doc = result.document

# Access tables
for table in doc.tables:
    print(f"Table: {table.export_to_markdown()}")

Word Documents

Process Microsoft Word documents:

Python
converter = DocumentConverter()
result = converter.convert("document.docx")
print(result.document.export_to_markdown())

Images

Extract text from images using OCR:

Python
converter = DocumentConverter()
result = converter.convert("image.png")
print(result.document.export_to_markdown())

Audio Files

Transcribe audio files with ASR models:

Python
converter = DocumentConverter()
result = converter.convert("audio.mp3")
print(result.document.export_to_markdown())

See all supported formats.

Configuration Options

Customize Docling's behavior with configuration options:

Python
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat

# Configure converter
converter = DocumentConverter(
    format=InputFormat.PDF,
    # Add more configuration options here
)

result = converter.convert("document.pdf")

Learn more about configuration options.

Using the CLI

You can also use Docling from the command line without writing any Python code:

Terminal
# Convert a document from URL
docling https://arxiv.org/pdf/2206.01062

# Convert a local file
docling document.pdf

# Use VLM pipeline
docling --pipeline vlm --vlm-model granite_docling document.pdf

Learn more about the CLI.

Integration with AI Frameworks

Docling integrates seamlessly with popular AI frameworks:

LangChain

Python
from langchain_community.document_loaders import DoclingLoader

loader = DoclingLoader("document.pdf")
documents = loader.load()

LlamaIndex

Python
from llama_index.readers.docling import DoclingReader

reader = DoclingReader()
documents = reader.load_data("document.pdf")

Explore all available integrations.

Next Steps

Now that you've converted your first document, here's what to explore next:

Common Use Cases

RAG Applications

Prepare documents for Retrieval-Augmented Generation:

Python
converter = DocumentConverter()
result = converter.convert("document.pdf")
markdown = result.document.export_to_markdown()

# Use with your RAG system
# ... process markdown for embeddings ...

Document Analysis

Extract and analyze document structure:

Python
converter = DocumentConverter()
result = converter.convert("document.pdf")
doc = result.document

# Analyze document structure
print(f"Pages: {len(doc.pages)}")
print(f"Tables: {len(doc.tables)}")
print(f"Images: {len(doc.images)}")

Batch Processing

Process multiple documents:

Python
import os
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
documents = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]

for doc_path in documents:
    result = converter.convert(doc_path)
    output_path = f"{doc_path}.md"
    with open(output_path, "w") as f:
        f.write(result.document.export_to_markdown())