Web Search¶
Enrich your data with web search using Tavily.
Overview¶
DataFrameIt can search the web for information to complement the analysis of each text. This is useful when you need additional context not present in the original text.
Setup¶
1. Install Dependency¶
2. Configure API Key¶
Get your key at: Tavily
Basic Usage¶
from pydantic import BaseModel, Field
from typing import Literal
import pandas as pd
from dataframeit import dataframeit
class CompanyInfo(BaseModel):
sector: Literal['technology', 'health', 'finance', 'retail', 'other']
description: str = Field(description="Brief company description")
founded: str = Field(description="Year founded, if found")
# Data with company names
df = pd.DataFrame({
'text': ['Microsoft', 'Stripe', 'DoorDash']
})
PROMPT = """
Based on available information and web search,
extract information about the mentioned company.
"""
# Enable web search with use_search=True
result = dataframeit(
df,
CompanyInfo,
PROMPT,
text_column='text',
use_search=True, # Enable web search
max_results=5 # Number of results per search
)
Search Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
use_search |
bool | False |
Enable web search via Tavily |
search_per_field |
bool | False |
Execute separate search for each model field |
max_results |
int | 5 |
Results per search (1-20) |
search_depth |
str | 'basic' |
'basic' (1 credit) or 'advanced' (2 credits) |
save_trace |
bool/str | None |
Save agent trace: True/"full" or "minimal" |
Examples¶
Basic Search¶
Search per Field¶
When the model has many fields, it can be useful to do separate searches:
result = dataframeit(
df, Model, PROMPT,
text_column='text',
use_search=True,
search_per_field=True # One search per model field
)
Deep Search¶
# More detailed search (slower, more expensive)
result = dataframeit(
df, Model, PROMPT,
text_column='text',
use_search=True,
search_depth='advanced',
max_results=10
)
Debug: Save Agent Trace (v0.5.3+)¶
To debug and audit agent reasoning, use the save_trace parameter.
Parameters¶
| Value | Description |
|---|---|
False / None |
Disabled (default) |
True / "full" |
Complete trace with message content |
"minimal" |
Only queries and counts, without search result content |
Generated Columns¶
- Single agent:
_trace - Per-field:
_trace_{field_name}for each field
Trace Structure¶
{
"messages": [
{"type": "human", "content": "Analyze the medication..."},
{"type": "ai", "content": "", "tool_calls": [...]},
{"type": "tool", "content": "[search results]", "tool_call_id": "..."}
],
"search_queries": ["query1", "query2"],
"total_tool_calls": 2,
"duration_seconds": 3.45,
"model": "gpt-4o-mini"
}
Example: Full Trace¶
import json
result = dataframeit(
df,
MedicationInfo,
PROMPT,
use_search=True,
save_trace=True # or "full"
)
# Access trace from first row
trace = json.loads(result['_trace'].iloc[0])
print(f"Queries performed: {trace['search_queries']}")
print(f"Duration: {trace['duration_seconds']}s")
print(f"Model: {trace['model']}")
Example: Minimal Trace¶
For audits where only the search queries matter:
result = dataframeit(
df, Model, PROMPT,
use_search=True,
save_trace="minimal" # Excludes search result content
)
Example: Per-Field Trace¶
result = dataframeit(
df,
MedicationInfo,
PROMPT,
use_search=True,
search_per_field=True,
save_trace="full"
)
# Each field has its own trace
trace_ingredient = json.loads(result['_trace_active_ingredient'].iloc[0])
trace_indication = json.loads(result['_trace_indication'].iloc[0])
Search Groups (v0.5.3+)¶
When multiple fields need the same search context, you can group them to reduce redundant API calls.
Motivation¶
Without groups, if you have 6 fields with search_per_field=True, 6 searches are made per row. With groups, related fields share a single search.
Example:
- Fields fda_status, ema_approval, clinical_trials are all about regulation
- Without groups: 3 separate searches (redundant)
- With groups: 1 shared search (efficient)
Group Parameters¶
| Parameter | Type | Required | Description |
|---|---|---|---|
fields |
list | Yes | List of fields belonging to the group |
prompt |
str | No | Custom prompt for the group. Use {query} for the text |
max_results |
int | No | Results override (1-20) |
search_depth |
str | No | Override: "basic" or "advanced" |
Basic Example¶
from pydantic import BaseModel, Field
class DrugRegulatory(BaseModel):
# Group "regulatory" fields (1 shared search)
fda_status: str = Field(description="FDA approval status")
ema_approval: str = Field(description="EMA approval status")
clinical_trials: str = Field(description="Ongoing clinical trials")
# Isolated fields (1 search each)
name: str = Field(description="Drug name")
manufacturer: str = Field(description="Manufacturer")
result = dataframeit(
df,
DrugRegulatory,
"Research the drug: {texto}",
text_column='text',
use_search=True,
search_per_field=True,
search_groups={
"regulatory": {
"fields": ["fda_status", "ema_approval", "clinical_trials"],
"prompt": "Search regulatory status (FDA, EMA, clinical trials) for: {query}",
"search_depth": "advanced",
}
}
)
Result: - Before: 5 searches (1 per field) - After: 3 searches (1 for group + 2 isolated)
Multiple Groups¶
search_groups={
"regulatory": {
"fields": ["fda_status", "ema_approval"],
"prompt": "Search regulatory status: {query}",
},
"clinical": {
"fields": ["efficacy", "safety"],
"prompt": "Search clinical studies about: {query}",
"search_depth": "advanced",
}
}
Traces with Groups¶
With save_trace=True, traces are organized by group:
result = dataframeit(
df, Model, PROMPT,
use_search=True,
search_per_field=True,
search_groups={"regulatory": {"fields": ["fda_status", "ema_approval"]}},
save_trace=True
)
# Group trace
trace_regulatory = json.loads(result['_trace_regulatory'].iloc[0])
# Isolated field traces
trace_name = json.loads(result['_trace_name'].iloc[0])
Validation Rules¶
- Requires
use_search=Trueandsearch_per_field=True - Fields must exist in the Pydantic model
- Fields cannot be in multiple groups
- Fields in groups cannot have
json_schema_extrafor search - choose between per-field or group configuration, not both
Use Case: Fact Checking¶
from pydantic import BaseModel, Field
from typing import Literal, List
class FactCheck(BaseModel):
claim: str = Field(description="The original claim")
verdict: Literal['true', 'false', 'partially_true', 'inconclusive']
sources: List[str] = Field(description="Sources supporting the verdict")
explanation: str = Field(description="Explanation of the verdict")
PROMPT = """
Verify the truthfulness of the claim using web search information.
Cite the sources found.
"""
result = dataframeit(
df_claims,
FactCheck,
PROMPT,
text_column='text',
use_search=True,
max_results=5,
search_depth='advanced'
)
Use Case: Lead Enrichment¶
class EnrichedLead(BaseModel):
company: str
website: str = Field(description="Official website")
linkedin: str = Field(description="LinkedIn URL")
size: Literal['startup', 'sme', 'enterprise']
technologies: List[str] = Field(description="Technologies used")
result = dataframeit(
df_leads,
EnrichedLead,
"Research information about the company.",
text_column='text',
use_search=True,
max_results=3
)
Costs and Limits¶
Watch costs
Each DataFrame row makes a web search. For large datasets, this can generate significant costs on the Tavily API.
- Free tier: 1000 searches/month
- Basic search: ~$0.01 per search
- Advanced search: ~$0.02 per search
Tips to Save¶
- Use
max_results=3to5(enough for most cases) - Prefer
search_depth='basic' - Filter your DataFrame before processing
- Use
search_per_field=Falsewhen possible
Rate Limits and Parallel Processing¶
HTTP 429 errors
Using parallel_requests with web search can easily exceed the search provider's rate limits. Searches then fail silently and return incomplete data.
Provider Limits¶
| Provider | Approximate rate limit |
|---|---|
| Tavily | ~100 req/min |
| Exa | ~300 req/min |
If you need higher throughput, consider search_provider="exa".
How Queries are Counted¶
| Configuration | Queries per row |
|---|---|
search_per_field=False |
1 per row |
search_per_field=True |
1 per field, per row |
With parallel_requests=20 and search_per_field=True on a 4-field model, you can fire ~80 concurrent queries — well above either provider's limit.
Recommended Settings¶
Tavily (default):
| Scenario | parallel_requests |
rate_limit_delay |
|---|---|---|
search_per_field=False |
5–10 | 0.5s |
search_per_field=True (2–3 fields) |
3–5 | 0.5s |
search_per_field=True (4+ fields) |
2–3 | 1.0s |
Exa:
| Scenario | parallel_requests |
rate_limit_delay |
|---|---|---|
search_per_field=False |
10–15 | 0.3s |
search_per_field=True (2–3 fields) |
5–8 | 0.3s |
search_per_field=True (4+ fields) |
3–5 | 0.5s |
# Safe settings for Tavily with multiple fields
result = dataframeit(
df, Model, PROMPT,
use_search=True, search_per_field=True,
parallel_requests=3, rate_limit_delay=0.5,
)
# Higher throughput with Exa
result = dataframeit(
df, Model, PROMPT,
use_search=True, search_provider="exa",
search_per_field=True,
parallel_requests=5, rate_limit_delay=0.3,
)
Automatic Warning¶
DataFrameIt emits a UserWarning when the configuration looks risky (high concurrent queries or estimated rate close to the provider limit), with recommended parallel_requests and rate_limit_delay values to avoid HTTP 429. The warning also fires on sequential runs when search_per_field=True produces many queries (>100 total).