Web Search¶

Enrich your data with web search using Tavily.

Overview¶

DataFrameIt can search the web for information to complement the analysis of each text. This is useful when you need additional context not present in the original text.

Setup¶

1. Install Dependency¶

pip install dataframeit[search]
# or
pip install langchain-tavily

2. Configure API Key¶

export TAVILY_API_KEY="your-tavily-key"

Get your key at: Tavily

Basic Usage¶

from pydantic import BaseModel, Field
from typing import Literal
import pandas as pd
from dataframeit import dataframeit

class CompanyInfo(BaseModel):
    sector: Literal['technology', 'health', 'finance', 'retail', 'other']
    description: str = Field(description="Brief company description")
    founded: str = Field(description="Year founded, if found")

# Data with company names
df = pd.DataFrame({
    'text': ['Microsoft', 'Stripe', 'DoorDash']
})

PROMPT = """
Based on available information and web search,
extract information about the mentioned company.
"""

# Enable web search with use_search=True
result = dataframeit(
    df,
    CompanyInfo,
    PROMPT,
    text_column='text',
    use_search=True,      # Enable web search
    max_results=5         # Number of results per search
)

Search Parameters¶

Parameter	Type	Default	Description
`use_search`	bool	`False`	Enable web search via Tavily
`search_per_field`	bool	`False`	Execute separate search for each model field
`max_results`	int	`5`	Results per search (1-20)
`search_depth`	str	`'basic'`	`'basic'` (1 credit) or `'advanced'` (2 credits)
`save_trace`	bool/str	`None`	Save agent trace: `True`/`"full"` or `"minimal"`

Examples¶

Basic Search¶

result = dataframeit(
    df, Model, PROMPT,
    text_column='text',
    use_search=True
)

Search per Field¶

When the model has many fields, it can be useful to do separate searches:

result = dataframeit(
    df, Model, PROMPT,
    text_column='text',
    use_search=True,
    search_per_field=True  # One search per model field
)

Deep Search¶

# More detailed search (slower, more expensive)
result = dataframeit(
    df, Model, PROMPT,
    text_column='text',
    use_search=True,
    search_depth='advanced',
    max_results=10
)

Debug: Save Agent Trace (v0.5.3+)¶

To debug and audit agent reasoning, use the save_trace parameter.

Parameters¶

Value	Description
`False` / `None`	Disabled (default)
`True` / `"full"`	Complete trace with message content
`"minimal"`	Only queries and counts, without search result content

Generated Columns¶

Single agent: _trace
Per-field: _trace_{field_name} for each field

Trace Structure¶

{
    "messages": [
        {"type": "human", "content": "Analyze the medication..."},
        {"type": "ai", "content": "", "tool_calls": [...]},
        {"type": "tool", "content": "[search results]", "tool_call_id": "..."}
    ],
    "search_queries": ["query1", "query2"],
    "total_tool_calls": 2,
    "duration_seconds": 3.45,
    "model": "gpt-4o-mini"
}

Example: Full Trace¶

import json

result = dataframeit(
    df,
    MedicationInfo,
    PROMPT,
    use_search=True,
    save_trace=True  # or "full"
)

# Access trace from first row
trace = json.loads(result['_trace'].iloc[0])
print(f"Queries performed: {trace['search_queries']}")
print(f"Duration: {trace['duration_seconds']}s")
print(f"Model: {trace['model']}")

Example: Minimal Trace¶

For audits where only the search queries matter:

result = dataframeit(
    df, Model, PROMPT,
    use_search=True,
    save_trace="minimal"  # Excludes search result content
)

Example: Per-Field Trace¶

result = dataframeit(
    df,
    MedicationInfo,
    PROMPT,
    use_search=True,
    search_per_field=True,
    save_trace="full"
)

# Each field has its own trace
trace_ingredient = json.loads(result['_trace_active_ingredient'].iloc[0])
trace_indication = json.loads(result['_trace_indication'].iloc[0])

Search Groups (v0.5.3+)¶

When multiple fields need the same search context, you can group them to reduce redundant API calls.

Motivation¶

Without groups, if you have 6 fields with search_per_field=True, 6 searches are made per row. With groups, related fields share a single search.

Example: - Fields fda_status, ema_approval, clinical_trials are all about regulation - Without groups: 3 separate searches (redundant) - With groups: 1 shared search (efficient)

Group Parameters¶

Parameter	Type	Required	Description
`fields`	list	Yes	List of fields belonging to the group
`prompt`	str	No	Custom prompt for the group. Use `{query}` for the text
`max_results`	int	No	Results override (1-20)
`search_depth`	str	No	Override: `"basic"` or `"advanced"`

Basic Example¶

from pydantic import BaseModel, Field

class DrugRegulatory(BaseModel):
    # Group "regulatory" fields (1 shared search)
    fda_status: str = Field(description="FDA approval status")
    ema_approval: str = Field(description="EMA approval status")
    clinical_trials: str = Field(description="Ongoing clinical trials")

    # Isolated fields (1 search each)
    name: str = Field(description="Drug name")
    manufacturer: str = Field(description="Manufacturer")

result = dataframeit(
    df,
    DrugRegulatory,
    "Research the drug: {texto}",
    text_column='text',
    use_search=True,
    search_per_field=True,
    search_groups={
        "regulatory": {
            "fields": ["fda_status", "ema_approval", "clinical_trials"],
            "prompt": "Search regulatory status (FDA, EMA, clinical trials) for: {query}",
            "search_depth": "advanced",
        }
    }
)

Result: - Before: 5 searches (1 per field) - After: 3 searches (1 for group + 2 isolated)

Multiple Groups¶

search_groups={
    "regulatory": {
        "fields": ["fda_status", "ema_approval"],
        "prompt": "Search regulatory status: {query}",
    },
    "clinical": {
        "fields": ["efficacy", "safety"],
        "prompt": "Search clinical studies about: {query}",
        "search_depth": "advanced",
    }
}

Traces with Groups¶

With save_trace=True, traces are organized by group:

result = dataframeit(
    df, Model, PROMPT,
    use_search=True,
    search_per_field=True,
    search_groups={"regulatory": {"fields": ["fda_status", "ema_approval"]}},
    save_trace=True
)

# Group trace
trace_regulatory = json.loads(result['_trace_regulatory'].iloc[0])

# Isolated field traces
trace_name = json.loads(result['_trace_name'].iloc[0])

Validation Rules¶

Requires use_search=True and search_per_field=True
Fields must exist in the Pydantic model
Fields cannot be in multiple groups
Fields in groups cannot have json_schema_extra for search - choose between per-field or group configuration, not both

Use Case: Fact Checking¶

from pydantic import BaseModel, Field
from typing import Literal, List

class FactCheck(BaseModel):
    claim: str = Field(description="The original claim")
    verdict: Literal['true', 'false', 'partially_true', 'inconclusive']
    sources: List[str] = Field(description="Sources supporting the verdict")
    explanation: str = Field(description="Explanation of the verdict")

PROMPT = """
Verify the truthfulness of the claim using web search information.
Cite the sources found.
"""

result = dataframeit(
    df_claims,
    FactCheck,
    PROMPT,
    text_column='text',
    use_search=True,
    max_results=5,
    search_depth='advanced'
)

Use Case: Lead Enrichment¶

class EnrichedLead(BaseModel):
    company: str
    website: str = Field(description="Official website")
    linkedin: str = Field(description="LinkedIn URL")
    size: Literal['startup', 'sme', 'enterprise']
    technologies: List[str] = Field(description="Technologies used")

result = dataframeit(
    df_leads,
    EnrichedLead,
    "Research information about the company.",
    text_column='text',
    use_search=True,
    max_results=3
)

Costs and Limits¶

Watch costs

Each DataFrame row makes a web search. For large datasets, this can generate significant costs on the Tavily API.

Free tier: 1000 searches/month
Basic search: ~$0.01 per search
Advanced search: ~$0.02 per search

Tips to Save¶

Use max_results=3 to 5 (enough for most cases)
Prefer search_depth='basic'
Filter your DataFrame before processing
Use search_per_field=False when possible

Rate Limits and Parallel Processing¶

HTTP 429 errors

Using parallel_requests with web search can easily exceed the search provider's rate limits. Searches then fail silently and return incomplete data.

Provider Limits¶

Provider	Approximate rate limit
Tavily	~100 req/min
Exa	~300 req/min

If you need higher throughput, consider search_provider="exa".

How Queries are Counted¶

Configuration	Queries per row
`search_per_field=False`	1 per row
`search_per_field=True`	1 per field, per row

With parallel_requests=20 and search_per_field=True on a 4-field model, you can fire ~80 concurrent queries — well above either provider's limit.

Recommended Settings¶

Tavily (default):

Scenario	`parallel_requests`	`rate_limit_delay`
`search_per_field=False`	5–10	0.5s
`search_per_field=True` (2–3 fields)	3–5	0.5s
`search_per_field=True` (4+ fields)	2–3	1.0s

Exa:

Scenario	`parallel_requests`	`rate_limit_delay`
`search_per_field=False`	10–15	0.3s
`search_per_field=True` (2–3 fields)	5–8	0.3s
`search_per_field=True` (4+ fields)	3–5	0.5s

# Safe settings for Tavily with multiple fields
result = dataframeit(
    df, Model, PROMPT,
    use_search=True, search_per_field=True,
    parallel_requests=3, rate_limit_delay=0.5,
)

# Higher throughput with Exa
result = dataframeit(
    df, Model, PROMPT,
    use_search=True, search_provider="exa",
    search_per_field=True,
    parallel_requests=5, rate_limit_delay=0.3,
)

Automatic Warning¶

DataFrameIt emits a UserWarning when the configuration looks risky (high concurrent queries or estimated rate close to the provider limit), with recommended parallel_requests and rate_limit_delay values to avoid HTTP 429. The warning also fires on sequential runs when search_per_field=True produces many queries (>100 total).