Concepts¶

Understand the fundamental concepts of DataFrameIt.

Architecture¶

┌─────────────────────────────────────────────────────────────┐
│                        dataframeit()                         │
├─────────────────────────────────────────────────────────────┤
│  Input             │  Processing         │  Output          │
│  ─────             │  ──────────         │  ──────          │
│  • DataFrame       │  • For each row:    │  • DataFrame     │
│  • Series          │    1. Build prompt  │    with extracted│
│  • List            │    2. Call LLM      │    columns       │
│  • Dict            │    3. Validate resp.│                  │
│                    │    4. Retry on error│                  │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                     LangChain + Provider                     │
├─────────────────────────────────────────────────────────────┤
│  Google Gemini │ OpenAI │ Anthropic │ Cohere │ Mistral      │
└─────────────────────────────────────────────────────────────┘

Main Components¶

1. Pydantic Model¶

The Pydantic model defines what you want to extract. Each field becomes a column in the output DataFrame.

from pydantic import BaseModel, Field
from typing import Literal, Optional

class Analysis(BaseModel):
    # Required field with fixed values
    category: Literal['A', 'B', 'C'] = Field(
        description="Item category"
    )

    # Required field with free text
    summary: str = Field(
        description="Summary in one sentence"
    )

    # Optional field
    notes: Optional[str] = Field(
        default=None,
        description="Additional notes, if any"
    )

Why Pydantic?

Automatic validation: The LLM is forced to return data in the correct format
Documentation: Field descriptions help the LLM understand what to extract
Type safety: Type errors are caught automatically

2. Prompt Template¶

The prompt defines how the LLM should process each text.

# Simple - text is automatically added at the end
PROMPT = "Classify the sentiment of the text."

# With placeholder - control where text appears
PROMPT = """
You are a specialized analyst.

Document:
{texto}

Extract the requested information from the document above.
"""

3. Providers via LangChain¶

DataFrameIt uses LangChain to abstract different LLM providers:

Provider	Popular Models	Environment Variable
`google_genai`	gemini-3-flash-preview, gemini-2.5-flash, gemini-2.5-pro	`GOOGLE_API_KEY`
`openai`	gpt-5.2, gpt-5.2-mini, gpt-4.1	`OPENAI_API_KEY`
`anthropic`	claude-sonnet-4-5, claude-opus-4-6, claude-haiku-4-5	`ANTHROPIC_API_KEY`

Processing Flow¶

For each DataFrame row:
│
├─► 1. Build prompt (template + row text)
│
├─► 2. Send to LLM via LangChain
│
├─► 3. Receive structured response
│
├─► 4. Validate with Pydantic
│   │
│   ├─► Success: mark as 'processed'
│   │
│   └─► Error: retry with exponential backoff
│       │
│       ├─► Success after retry: mark as 'processed'
│       │
│       └─► Failure after max_retries: mark as 'error'
│
└─► 5. Add extracted fields to DataFrame

Automatic Columns¶

DataFrameIt automatically adds control columns:

Column	Description
`_dataframeit_status`	Status: `'processed'`, `'error'`, or `None`
`_error_details`	Error details (when status is `'error'`)
`_input_tokens`	Input tokens (with `track_tokens=True`)
`_output_tokens`	Output tokens (with `track_tokens=True`)

Next Steps¶

Basic Usage: Practical examples
Error Handling: Configure retry and fallbacks
Performance: Parallelism and rate limiting