Skip to content

Concepts

Understand the fundamental concepts of DataFrameIt.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                        dataframeit()                         │
├─────────────────────────────────────────────────────────────┤
│  Input             │  Processing         │  Output          │
│  ─────             │  ──────────         │  ──────          │
│  • DataFrame       │  • For each row:    │  • DataFrame     │
│  • Series          │    1. Build prompt  │    with extracted│
│  • List            │    2. Call LLM      │    columns       │
│  • Dict            │    3. Validate resp.│                  │
│                    │    4. Retry on error│                  │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│                     LangChain + Provider                     │
├─────────────────────────────────────────────────────────────┤
│  Google Gemini │ OpenAI │ Anthropic │ Cohere │ Mistral      │
└─────────────────────────────────────────────────────────────┘

Main Components

1. Pydantic Model

The Pydantic model defines what you want to extract. Each field becomes a column in the output DataFrame.

from pydantic import BaseModel, Field
from typing import Literal, Optional

class Analysis(BaseModel):
    # Required field with fixed values
    category: Literal['A', 'B', 'C'] = Field(
        description="Item category"
    )

    # Required field with free text
    summary: str = Field(
        description="Summary in one sentence"
    )

    # Optional field
    notes: Optional[str] = Field(
        default=None,
        description="Additional notes, if any"
    )

Why Pydantic?

  • Automatic validation: The LLM is forced to return data in the correct format
  • Documentation: Field descriptions help the LLM understand what to extract
  • Type safety: Type errors are caught automatically

2. Prompt Template

The prompt defines how the LLM should process each text.

# Simple - text is automatically added at the end
PROMPT = "Classify the sentiment of the text."

# With placeholder - control where text appears
PROMPT = """
You are a specialized analyst.

Document:
{texto}

Extract the requested information from the document above.
"""

3. Providers via LangChain

DataFrameIt uses LangChain to abstract different LLM providers:

Provider Popular Models Environment Variable
google_genai gemini-3-flash-preview, gemini-2.5-flash, gemini-2.5-pro GOOGLE_API_KEY
openai gpt-5.2, gpt-5.2-mini, gpt-4.1 OPENAI_API_KEY
anthropic claude-sonnet-4-5, claude-opus-4-6, claude-haiku-4-5 ANTHROPIC_API_KEY

Processing Flow

For each DataFrame row:
├─► 1. Build prompt (template + row text)
├─► 2. Send to LLM via LangChain
├─► 3. Receive structured response
├─► 4. Validate with Pydantic
│   │
│   ├─► Success: mark as 'processed'
│   │
│   └─► Error: retry with exponential backoff
│       │
│       ├─► Success after retry: mark as 'processed'
│       │
│       └─► Failure after max_retries: mark as 'error'
└─► 5. Add extracted fields to DataFrame

Automatic Columns

DataFrameIt automatically adds control columns:

Column Description
_dataframeit_status Status: 'processed', 'error', or None
_error_details Error details (when status is 'error')
_input_tokens Input tokens (with track_tokens=True)
_output_tokens Output tokens (with track_tokens=True)

Next Steps