Skip to content

Web Search

Enrich your data with web search using Tavily.

Overview

DataFrameIt can search the web for information to complement the analysis of each text. This is useful when you need additional context not present in the original text.

Setup

1. Install Dependency

pip install dataframeit[search]
# or
pip install langchain-tavily

2. Configure API Key

export TAVILY_API_KEY="your-tavily-key"

Get your key at: Tavily

Basic Usage

from pydantic import BaseModel, Field
from typing import Literal
import pandas as pd
from dataframeit import dataframeit

class CompanyInfo(BaseModel):
    sector: Literal['technology', 'health', 'finance', 'retail', 'other']
    description: str = Field(description="Brief company description")
    founded: str = Field(description="Year founded, if found")

# Data with company names
df = pd.DataFrame({
    'text': ['Microsoft', 'Stripe', 'DoorDash']
})

PROMPT = """
Based on available information and web search,
extract information about the mentioned company.
"""

# Enable web search with use_search=True
result = dataframeit(
    df,
    CompanyInfo,
    PROMPT,
    text_column='text',
    use_search=True,      # Enable web search
    max_results=5         # Number of results per search
)

Search Parameters

Parameter Type Default Description
use_search bool False Enable web search via Tavily
search_per_field bool False Execute separate search for each model field
max_results int 5 Results per search (1-20)
search_depth str 'basic' 'basic' (1 credit) or 'advanced' (2 credits)
save_trace bool/str None Save agent trace: True/"full" or "minimal"

Examples

result = dataframeit(
    df, Model, PROMPT,
    text_column='text',
    use_search=True
)

Search per Field

When the model has many fields, it can be useful to do separate searches:

result = dataframeit(
    df, Model, PROMPT,
    text_column='text',
    use_search=True,
    search_per_field=True  # One search per model field
)
# More detailed search (slower, more expensive)
result = dataframeit(
    df, Model, PROMPT,
    text_column='text',
    use_search=True,
    search_depth='advanced',
    max_results=10
)

Debug: Save Agent Trace (v0.5.3+)

To debug and audit agent reasoning, use the save_trace parameter.

Parameters

Value Description
False / None Disabled (default)
True / "full" Complete trace with message content
"minimal" Only queries and counts, without search result content

Generated Columns

  • Single agent: _trace
  • Per-field: _trace_{field_name} for each field

Trace Structure

{
    "messages": [
        {"type": "human", "content": "Analyze the medication..."},
        {"type": "ai", "content": "", "tool_calls": [...]},
        {"type": "tool", "content": "[search results]", "tool_call_id": "..."}
    ],
    "search_queries": ["query1", "query2"],
    "total_tool_calls": 2,
    "duration_seconds": 3.45,
    "model": "gpt-4o-mini"
}

Example: Full Trace

import json

result = dataframeit(
    df,
    MedicationInfo,
    PROMPT,
    use_search=True,
    save_trace=True  # or "full"
)

# Access trace from first row
trace = json.loads(result['_trace'].iloc[0])
print(f"Queries performed: {trace['search_queries']}")
print(f"Duration: {trace['duration_seconds']}s")
print(f"Model: {trace['model']}")

Example: Minimal Trace

For audits where only the search queries matter:

result = dataframeit(
    df, Model, PROMPT,
    use_search=True,
    save_trace="minimal"  # Excludes search result content
)

Example: Per-Field Trace

result = dataframeit(
    df,
    MedicationInfo,
    PROMPT,
    use_search=True,
    search_per_field=True,
    save_trace="full"
)

# Each field has its own trace
trace_ingredient = json.loads(result['_trace_active_ingredient'].iloc[0])
trace_indication = json.loads(result['_trace_indication'].iloc[0])

Search Groups (v0.5.3+)

When multiple fields need the same search context, you can group them to reduce redundant API calls.

Motivation

Without groups, if you have 6 fields with search_per_field=True, 6 searches are made per row. With groups, related fields share a single search.

Example: - Fields fda_status, ema_approval, clinical_trials are all about regulation - Without groups: 3 separate searches (redundant) - With groups: 1 shared search (efficient)

Group Parameters

Parameter Type Required Description
fields list Yes List of fields belonging to the group
prompt str No Custom prompt for the group. Use {query} for the text
max_results int No Results override (1-20)
search_depth str No Override: "basic" or "advanced"

Basic Example

from pydantic import BaseModel, Field

class DrugRegulatory(BaseModel):
    # Group "regulatory" fields (1 shared search)
    fda_status: str = Field(description="FDA approval status")
    ema_approval: str = Field(description="EMA approval status")
    clinical_trials: str = Field(description="Ongoing clinical trials")

    # Isolated fields (1 search each)
    name: str = Field(description="Drug name")
    manufacturer: str = Field(description="Manufacturer")

result = dataframeit(
    df,
    DrugRegulatory,
    "Research the drug: {texto}",
    text_column='text',
    use_search=True,
    search_per_field=True,
    search_groups={
        "regulatory": {
            "fields": ["fda_status", "ema_approval", "clinical_trials"],
            "prompt": "Search regulatory status (FDA, EMA, clinical trials) for: {query}",
            "search_depth": "advanced",
        }
    }
)

Result: - Before: 5 searches (1 per field) - After: 3 searches (1 for group + 2 isolated)

Multiple Groups

search_groups={
    "regulatory": {
        "fields": ["fda_status", "ema_approval"],
        "prompt": "Search regulatory status: {query}",
    },
    "clinical": {
        "fields": ["efficacy", "safety"],
        "prompt": "Search clinical studies about: {query}",
        "search_depth": "advanced",
    }
}

Traces with Groups

With save_trace=True, traces are organized by group:

result = dataframeit(
    df, Model, PROMPT,
    use_search=True,
    search_per_field=True,
    search_groups={"regulatory": {"fields": ["fda_status", "ema_approval"]}},
    save_trace=True
)

# Group trace
trace_regulatory = json.loads(result['_trace_regulatory'].iloc[0])

# Isolated field traces
trace_name = json.loads(result['_trace_name'].iloc[0])

Validation Rules

  1. Requires use_search=True and search_per_field=True
  2. Fields must exist in the Pydantic model
  3. Fields cannot be in multiple groups
  4. Fields in groups cannot have json_schema_extra for search - choose between per-field or group configuration, not both

Use Case: Fact Checking

from pydantic import BaseModel, Field
from typing import Literal, List

class FactCheck(BaseModel):
    claim: str = Field(description="The original claim")
    verdict: Literal['true', 'false', 'partially_true', 'inconclusive']
    sources: List[str] = Field(description="Sources supporting the verdict")
    explanation: str = Field(description="Explanation of the verdict")

PROMPT = """
Verify the truthfulness of the claim using web search information.
Cite the sources found.
"""

result = dataframeit(
    df_claims,
    FactCheck,
    PROMPT,
    text_column='text',
    use_search=True,
    max_results=5,
    search_depth='advanced'
)

Use Case: Lead Enrichment

class EnrichedLead(BaseModel):
    company: str
    website: str = Field(description="Official website")
    linkedin: str = Field(description="LinkedIn URL")
    size: Literal['startup', 'sme', 'enterprise']
    technologies: List[str] = Field(description="Technologies used")

result = dataframeit(
    df_leads,
    EnrichedLead,
    "Research information about the company.",
    text_column='text',
    use_search=True,
    max_results=3
)

Costs and Limits

Watch costs

Each DataFrame row makes a web search. For large datasets, this can generate significant costs on the Tavily API.

  • Free tier: 1000 searches/month
  • Basic search: ~$0.01 per search
  • Advanced search: ~$0.02 per search

Tips to Save

  1. Use max_results=3 to 5 (enough for most cases)
  2. Prefer search_depth='basic'
  3. Filter your DataFrame before processing
  4. Use search_per_field=False when possible

Rate Limits and Parallel Processing

HTTP 429 errors

Using parallel_requests with web search can easily exceed the search provider's rate limits. Searches then fail silently and return incomplete data.

Provider Limits

Provider Approximate rate limit
Tavily ~100 req/min
Exa ~300 req/min

If you need higher throughput, consider search_provider="exa".

How Queries are Counted

Configuration Queries per row
search_per_field=False 1 per row
search_per_field=True 1 per field, per row

With parallel_requests=20 and search_per_field=True on a 4-field model, you can fire ~80 concurrent queries — well above either provider's limit.

Tavily (default):

Scenario parallel_requests rate_limit_delay
search_per_field=False 5–10 0.5s
search_per_field=True (2–3 fields) 3–5 0.5s
search_per_field=True (4+ fields) 2–3 1.0s

Exa:

Scenario parallel_requests rate_limit_delay
search_per_field=False 10–15 0.3s
search_per_field=True (2–3 fields) 5–8 0.3s
search_per_field=True (4+ fields) 3–5 0.5s
# Safe settings for Tavily with multiple fields
result = dataframeit(
    df, Model, PROMPT,
    use_search=True, search_per_field=True,
    parallel_requests=3, rate_limit_delay=0.5,
)

# Higher throughput with Exa
result = dataframeit(
    df, Model, PROMPT,
    use_search=True, search_provider="exa",
    search_per_field=True,
    parallel_requests=5, rate_limit_delay=0.3,
)

Automatic Warning

DataFrameIt emits a UserWarning when the configuration looks risky (high concurrent queries or estimated rate close to the provider limit), with recommended parallel_requests and rate_limit_delay values to avoid HTTP 429. The warning also fires on sequential runs when search_per_field=True produces many queries (>100 total).