Python

The Python SDK wraps the Spidra API so you’re not writing raw HTTP calls and polling loops yourself. It handles job submission, status polling, retry logic, and error mapping to typed exceptions.

Installation

pip install spidra

Requires Python 3.9 or higher.

Get your API key from app.spidra.io under Settings > API Keys. Store it as an environment variable. Never hardcode it.

Getting started

from spidra import Spidra

spidra = Spidra(api_key="spd_YOUR_API_KEY")

Name the instance spidra, client, or whatever fits your codebase — the method names stay the same.

If you’re inside an existing async context (FastAPI, asyncio, Jupyter notebook), use AsyncSpidra instead and await the calls. The method signatures are identical.

Scraping

The scraper accepts up to three URLs per request and processes them in parallel. You can pass a URL string directly, or a ScrapeParams object for full control. The simplest call:

from spidra import Spidra

spidra = Spidra(api_key="spd_YOUR_API_KEY")

result = spidra.scrape(
    "https://example.com/pricing",
    prompt="Extract all pricing plans with name, price, and included features",
    output="json",
)

print(result.content)
# {"plans": [{"name": "Starter", "price": "$9/mo", ...}]}

If you’d rather fire and move on, start_scrape() returns a job ID immediately. You can then call get_scrape() whenever you’re ready to check:

queued = spidra.start_scrape(
    "https://example.com",
    prompt="Extract the main headline",
)

# Later...
status = spidra.get_scrape(queued.job_id)

if status.status == "completed":
    print(status.result.content)

Job statuses move through: queued → waiting → active → completed (or failed).

Scrape parameters

Parameter	Type	Description
`urls`	list	Up to 3 `ScrapeUrl` objects. Each takes a `url` and optional `actions`
`prompt`	str	What to extract, written in plain English
`output`	str	`"markdown"` (default) or `"json"`
`schema`	dict	JSON Schema that forces a specific output shape
`use_proxy`	bool	Route through a residential proxy
`proxy_country`	str	Two-letter country code: `"us"`, `"de"`, `"jp"`, etc.
`extract_content_only`	bool	Strip nav, ads, and boilerplate before the AI sees the page
`screenshot`	bool	Capture a viewport screenshot
`full_page_screenshot`	bool	Capture a full-page scrolled screenshot
`cookies`	str	Raw `Cookie` header string for pages behind a login

Enforcing an exact output shape

Without a schema the AI extracts what it finds. With a schema, missing fields come back as None rather than guessed values, which matters when the output feeds a database or a typed pipeline downstream:

Use the Spidra JSON Schema Generator to build and preview your schema visually before pasting it here.

from spidra import ScrapeParams, ScrapeUrl

result = spidra.scrape(ScrapeParams(
    urls=[ScrapeUrl(url="https://jobs.example.com/senior-engineer")],
    prompt="Extract the job listing details",
    output="json",
    schema={
        "type": "object",
        "required": ["title", "company", "remote"],
        "properties": {
            "title":      {"type": "string"},
            "company":    {"type": "string"},
            "remote":     {"type": ["boolean", "null"]},
            "salary_min": {"type": ["number", "null"]},
            "skills":     {"type": "array", "items": {"type": "string"}},
        },
    },
))

Define every field you want extracted. An untyped object with no properties (or an array of them) gives the AI nothing to fill in, so those members come back empty.

Enforcing shape with Pydantic

If you already model your data with Pydantic, skip the JSON Schema entirely — pass the model itself (class or instance) and the SDK converts it for you:

from pydantic import BaseModel

class JobListing(BaseModel):
    title: str
    company: str
    remote: bool | None = None
    skills: list[str] = []

result = spidra.scrape(
    "https://jobs.example.com/senior-engineer",
    prompt="Extract the job listing details",
    output="json",
    schema=JobListing,
)

listing = JobListing.model_validate(result.content)  # validated, typed access

The same works on batch_scrape() and crawl() (applied per page). Pydantic stays optional — install it with pip install spidra[pydantic] only if you use this. Both v2 and v1 models are supported.

Scraping geo-restricted content

Some sites serve different prices or content depending on where you’re browsing from. Set use_proxy=True and a proxy_country code to route through a residential IP in that country:

result = spidra.scrape(
    "https://www.amazon.de/gp/bestsellers",
    prompt="List the top 10 products with name and price",
    use_proxy=True,
    proxy_country="de",
)

Supported country codes include us, gb, de, fr, jp, au, ca, br, in, nl, and 40+ more. Use "global" or "eu" for regional routing without pinning to a specific country. If the page requires a session, pass your cookies as a raw header string. The easiest way to get this is to log in through your browser, open devtools, and copy the Cookie header from any authenticated request:

result = spidra.scrape(
    "https://app.example.com/dashboard",
    prompt="Extract the monthly revenue and active user count",
    cookies="session=abc123; auth_token=xyz789",
)

Browser actions

Sometimes you need to interact with the page before extraction — dismiss a cookie banner, type into a search box, scroll to load lazy content. Pass an actions list inside the ScrapeUrl and they run in order before the AI sees the page:

from spidra import ScrapeUrl, BrowserAction

result = spidra.scrape(
    ScrapeUrl(
        url="https://example.com/products",
        actions=[
            BrowserAction(type="click", selector="#accept-cookies"),
            BrowserAction(type="wait", duration=1000),
            BrowserAction(type="scroll", to="80%"),
        ],
    ),
    prompt="Extract all product names and prices visible on the page",
)

For selector you can pass a CSS selector or XPath. If you’d rather describe the element in plain English, use value and Spidra will locate it with AI.

Action	What it does
`click`	Click any element — use `selector` for CSS, `value` for plain text
`type`	Type into an input or textarea
`check`	Check a checkbox
`uncheck`	Uncheck a checkbox
`wait`	Pause for `duration` milliseconds
`scroll`	Scroll to a percentage of the page height, e.g. `"80%"`
`forEach`	Loop over every matched element and extract from each one

Controlling how long scrape() waits

By default scrape() polls every 3 seconds and waits until the job finishes, however long that takes. If you’d rather cap the wait, pass a timeout in seconds — when it fires, SpidraTimeoutError is raised and the job keeps running server-side, so you can check it later with get_scrape() or cancel it:

result = spidra.scrape(
    "https://example.com",
    prompt="...",
    poll_interval=5,
    timeout=300,
)

Transient hiccups mid-wait — a 502 blip, a dropped connection, a rate limit — don’t kill the wait; the SDK keeps polling unless several happen in a row. The same options work on batch_scrape() and crawl().

Batch scraping

When you have a list of URLs to process, batch is the right tool. You can submit up to 50 URLs in a single request and they all run in parallel.

batch = spidra.batch_scrape(
    ["https://shop.example.com/product/1", "https://shop.example.com/product/2"],
    prompt="Extract product name, price, and whether it is in stock",
    output="json",
)

print(f"{batch.completed_count}/{batch.total_urls} completed")

for item in batch.items:
    if item.status == "completed":
        print(item.url, item.result)
    else:
        print(f"Failed: {item.url} — {item.error}")

Each item moves through pending → running → completed (or failed). If you don’t want to wait for the whole thing to finish, use start_batch_scrape() and get_batch_scrape() separately:

queued = spidra.start_batch_scrape(
    ["https://example.com/1", "https://example.com/2"],
    prompt="Extract the page title and meta description",
)

# Come back later
result = spidra.get_batch_scrape(queued.batch_id)
print(f"{result.completed_count} of {result.total_urls} done")

Retrying failures and cancelling

if batch.failed_count > 0:
    spidra.retry_batch_scrape(queued.batch_id)

# Cancel a running batch and refund unprocessed credits
response = spidra.cancel_batch_scrape(batch_id)
print(f"Cancelled {response.cancelled_items} items, refunded {response.credits_refunded} credits")

To look through past batches:

page = spidra.list_batch_scrapes(page=1, limit=20)

for job in page.jobs:
    print(job.uuid, job.status, f"{job.completed_count}/{job.total_urls}")

Crawling

Crawling is different from scraping. You give it a starting URL and it discovers and processes pages on its own, following links according to your instructions. Good for indexing a docs site, monitoring a competitor’s blog, or building a structured dataset from an entire section of a site.

job = spidra.crawl(
    "https://competitor.com/blog",
    crawl_instruction="Follow links to blog posts only, skip tag pages and the homepage",
    transform_instruction="Extract the post title, author name, publish date, and a one-sentence summary",
    max_pages=30,
)

for page in job.result:
    print(page.url, page.data)

crawl_instruction tells the crawler which links to follow. transform_instruction tells the AI what to extract from each page. By default the call waits until the crawl finishes — pass timeout=<seconds> to bound the wait (the job keeps running server-side if it fires).

Raw content mode

Omit both transform_instruction and schema to get the raw page content without any AI processing. Each page’s data field contains the plain markdown of that page — no token credits are charged:

job = spidra.crawl(
    "https://docs.example.com",
    crawl_instruction="Crawl all documentation pages",
    max_pages=50,
)

for page in job.result:
    # page.data contains the raw markdown of each page
    print(page.url, page.data[:200])

This is useful when you want to feed the content into your own AI pipeline.

Structured output with schema

When you need every page to return the same fields in the same format, use schema. The AI returns JSON matching it exactly for every page:

Use the Spidra JSON Schema Generator to build and preview your schema visually before pasting it here.

from spidra import CrawlParams

job = spidra.crawl(CrawlParams(
    base_url="https://example.com/jobs",
    crawl_instruction="Crawl all job listing pages",
    schema={
        "type": "object",
        "properties": {
            "title":    {"type": "string"},
            "location": {"type": "string"},
            "salary":   {"type": "string"},
            "remote":   {"type": "boolean"},
        },
    },
    max_pages=20,
))

for page in job.result:
    print(page.data)  # {"title": "...", "location": "...", ...}

Scoped crawling with path filters

Use include_paths and exclude_paths to keep crawls focused on the content you actually need. Both accept glob-style patterns:

job = spidra.crawl(CrawlParams(
    base_url="https://example.com",
    crawl_instruction="Crawl all documentation pages",
    transform_instruction="Extract the page title and main content",
    include_paths=["/docs/*"],
    exclude_paths=["/docs/changelog/*", "/docs/legacy/*"],
    max_pages=30,
))

Crawl parameters

Parameter	Type	Default	Description
`base_url`	str	required	Starting URL for the crawl
`crawl_instruction`	str	`"Find all pages on the website"`	Which links to follow, in plain language
`transform_instruction`	str	—	What to extract from each page. Omit for raw markdown mode
`schema`	dict	—	JSON Schema defining the exact output structure per page
`max_pages`	int	`5`	Maximum pages to crawl (1–50)
`max_depth`	int	unlimited	Max link depth from the base URL. `0` = base URL only
`include_paths`	list[str]	—	URL path patterns to include, e.g. `["/blog/*"]`
`exclude_paths`	list[str]	—	URL path patterns to skip, e.g. `["/tag/*"]`
`allow_subdomains`	bool	`False`	Follow links to subdomains of the base domain
`crawl_entire_domain`	bool	`False`	Follow any link on the same root domain
`ignore_query_params`	bool	`False`	Treat URLs differing only by query string as the same page
`webhook_url`	str	—	Receive a POST request for each processed page and on job completion
`use_proxy`	bool	`False`	Route through a residential proxy
`proxy_country`	str	`"global"`	Two-letter country code. Requires `use_proxy=True`
`cookies`	str	—	Cookie header string for authenticated crawls

Submitting without waiting

Just like scraping, you can fire-and-forget with start_crawl() and poll with get_crawl():

queued = spidra.start_crawl(
    "https://example.com/docs",
    crawl_instruction="Follow all documentation pages",
    transform_instruction="Extract the page title and a short summary",
    max_pages=50,
)

# Poll manually
status = spidra.get_crawl(queued.job_id)
if status.status == "completed":
    for page in status.result:
        print(page.url, page.data)

Cancelling a crawl

Cancel a queued or running job at any time. Pages already processed are kept:

response = spidra.cancel_crawl(job_id)
print(response.status)  # "cancelled"

# Retrieve whatever was processed before cancellation
pages = spidra.crawl_pages(job_id)
for page in pages.pages:
    if page.status == "success":
        print(page.url, page.data)

Downloading the raw HTML and Markdown

Once a crawl completes, crawl_pages() returns signed download URLs for the raw HTML and Markdown of every page. These links expire after one hour:

response = spidra.crawl_pages(job_id)

for page in response.pages:
    print(page.url, page.status)
    # page.html     — signed URL for the raw HTML snapshot
    # page.markdown — signed URL for the Markdown version

Re-extracting with a different prompt

If you crawled a site and want to pull out different information, you don’t have to re-crawl. crawl_extract() runs a new AI pass over the already-crawled content and charges only transformation credits:

queued = spidra.crawl_extract(
    completed_job_id,
    "Extract only product SKUs and prices as structured JSON",
)

result = spidra.get_crawl(queued.job_id)

Browsing your crawl history

response = spidra.crawl_history(page=1, limit=10)
print(f"Total crawl jobs: {response.total}")

stats = spidra.crawl_stats()
print(f"All-time: {stats.total}")

Watching jobs (streaming results)

A 50-page crawl can take a while. Instead of waiting for the whole thing, watch_crawl() yields each page the moment it’s crawled — perfect for writing results to a database as they arrive or updating a progress bar:

queued = spidra.start_crawl(
    "https://competitor.com/blog",
    crawl_instruction="Follow blog post links only",
    transform_instruction="Extract title, author, and publish date",
    max_pages=50,
)

for page in spidra.watch_crawl(queued.job_id):
    print(page.url, page.data)  # fires once per crawled page

Batches work the same way — watch_batch() yields each item as it finishes (completed or failed):

queued = spidra.start_batch_scrape(urls, prompt="Extract product data")

for item in spidra.watch_batch(queued.batch_id):
    print(item.url, item.status, item.result)

On AsyncSpidra these are async generators — same names, just async for. Every page/item is yielded exactly once, including ones that finished before you started watching, and page content is only re-fetched when the crawl actually makes progress, so watching stays cheap. The loop ends when the job completes or is cancelled, raises SpidraJobFailedError if it fails, and breaking out early never cancels the job — use cancel_crawl() for that.

Logs

Every scrape request your API key makes gets logged automatically. You can filter by status, URL, date range, or where it came from:

response = spidra.scrape_logs(
    status="failed",
    search_term="amazon.com",
    start_date="2024-01-01",
    end_date="2024-12-31",
    page=1,
    limit=20,
)

for log in response.logs:
    print(log.urls[0].get("url"), log.status, log.credits_used)

To fetch the full details of a single log entry including the AI extraction output:

log = spidra.get_scrape_log(log_uuid)
print(log.result_data)

Usage statistics

Check how many requests and credits your account has used over a given period:

rows = spidra.usage("30d")  # "7d" | "30d" | "weekly"

for row in rows:
    print(row.date, row.requests, row.credits)

Retries and reliability

You don’t have to write retry loops. Transient failures — network blips, 502/503/504 gateway errors — are retried automatically with exponential backoff, so a single hiccup never fails your call. Both knobs are configurable:

spidra = Spidra(
    api_key="spd_YOUR_API_KEY",
    max_retries=3,        # retry attempts for transient failures (default: 3, 0 disables)
    backoff_factor=1.0,   # base seconds — delay is backoff_factor * 2**(attempt-1)
)

The retry policy is designed so it can never double-charge you: 4xx client errors are never retried, and job submissions are only retried when the server explicitly rejected them — never on network errors or gateway timeouts, where the job may already have been queued. When the server sends a Retry-After hint, the SDK honors it instead of its own backoff.

Error handling

Every API error is mapped to a typed exception class, so you can catch exactly what you care about and let the rest bubble up:

from spidra import (
    SpidraError,
    SpidraAuthenticationError,
    SpidraForbiddenError,
    SpidraValidationError,
    SpidraRateLimitError,
    SpidraServerError,
    SpidraJobFailedError,
    SpidraTimeoutError,
)

try:
    result = spidra.scrape("https://example.com", prompt="Extract the main headline")
except SpidraAuthenticationError:
    # 401: Missing or invalid API key
    pass
except SpidraForbiddenError:
    # 403: Monthly credit limit reached
    pass
except SpidraValidationError as e:
    # 422: Bad request body — e.errors lists each problem
    print(e.errors)
except SpidraRateLimitError as e:
    # 429: Too many requests — metadata tells you exactly how long to wait
    print(f"Rate limited. {e.remaining}/{e.limit} left, retry in {e.retry_after}s")
except SpidraJobFailedError as e:
    # The job itself failed or was cancelled (not a transport error)
    print(f"Job {e.job_id} {e.job_status}: {e.message}")
except SpidraTimeoutError as e:
    # Your poll timeout elapsed — the job is still running server-side
    print(f"Still running after {e.timeout_seconds}s, check {e.job_id} later")
except SpidraServerError as e:
    # 5xx: Something went wrong on Spidra's side (already retried automatically)
    print(f"Server error ({e.status}): {e.message}")
except SpidraError as e:
    # Catch-all for anything else
    print(f"API error {e.status}: {e.message}")

Exception	Status	When
`SpidraAuthenticationError`	401	Missing or invalid API key
`SpidraPaymentRequiredError`	402	Subscription payment overdue
`SpidraForbiddenError`	403	Monthly credit limit reached
`SpidraNotFoundError`	404	Job, batch, or log does not exist
`SpidraValidationError`	422	Request body failed validation — `e.errors` lists each problem
`SpidraRateLimitError`	429	Too many requests — carries `e.limit`, `e.remaining`, `e.reset_at`, `e.retry_after`
`SpidraServerError`	5xx	Unexpected server-side error
`SpidraJobFailedError`	—	The job itself failed or was cancelled — carries `e.job_id`, `e.job_status`
`SpidraTimeoutError`	—	Your poll `timeout` elapsed; the job is still running — carries `e.job_id`
`SpidraError`	any	Base class for all Spidra exceptions

All exceptions expose .status (the HTTP status code, or 0 for non-HTTP errors like job failures and timeouts) and .message. API errors also carry .code (a machine-readable identifier like SERVICE_BUSY) and .details (the raw error body).

Verifying webhooks

Crawl jobs can push crawl.page, crawl.completed, and crawl.failed events to your webhook_url. Spidra signs each delivery with HMAC-SHA256 in the X-Spidra-Signature header, and the SDK ships a helper so you never accept a forged event:

import json
from spidra import verify_webhook

# FastAPI example — verify against the RAW body, not the parsed JSON
@app.post("/webhooks/spidra")
async def spidra_webhook(request: Request):
    raw = await request.body()
    if not verify_webhook(raw, request.headers.get("x-spidra-signature"), WEBHOOK_SECRET):
        raise HTTPException(status_code=401)

    event = json.loads(raw)
    if event["event"] == "crawl.page":
        print("New page:", event["page"]["url"])
    return {"ok": True}

The helper uses only the standard library and compares in constant time. Always pass the raw request body — re-serialising parsed JSON produces different bytes and fails verification.

Ruby

Official Ruby SDK — pure stdlib, no external dependencies. Works in Rails, Sinatra, and scripts.

Elixir

Official Elixir SDK — idiomatic pattern matching, OTP-ready, works with Phoenix and plain Mix projects.

Overview

Official SDKs

AI Frameworks

Installation

Getting started

Scraping

Scrape parameters

Enforcing an exact output shape

Enforcing shape with Pydantic

Scraping geo-restricted content

Browser actions

Controlling how long scrape() waits

Batch scraping

Retrying failures and cancelling

Crawling

Raw content mode

Structured output with schema

Scoped crawling with path filters

Crawl parameters

Submitting without waiting

Cancelling a crawl

Downloading the raw HTML and Markdown

Re-extracting with a different prompt

Browsing your crawl history

Watching jobs (streaming results)

Logs

Usage statistics

Retries and reliability

Error handling

Verifying webhooks

Ruby

Elixir

​Installation

​Getting started

​Scraping

​Scrape parameters

​Enforcing an exact output shape

​Enforcing shape with Pydantic

​Scraping geo-restricted content

​Scraping pages behind a login

​Browser actions

​Controlling how long scrape() waits

​Batch scraping

​Retrying failures and cancelling

​Crawling

​Raw content mode

​Structured output with schema

​Scoped crawling with path filters

​Crawl parameters

​Submitting without waiting

​Cancelling a crawl

​Downloading the raw HTML and Markdown

​Re-extracting with a different prompt

​Browsing your crawl history

​Watching jobs (streaming results)

​Logs

​Usage statistics

​Retries and reliability

​Error handling

​Verifying webhooks

Ruby

Elixir

Installation

Getting started

Scraping

Scrape parameters

Enforcing an exact output shape

Enforcing shape with Pydantic

Scraping geo-restricted content

Scraping pages behind a login

Browser actions

Controlling how long scrape() waits

Batch scraping

Retrying failures and cancelling

Crawling

Raw content mode

Structured output with schema

Scoped crawling with path filters

Crawl parameters

Submitting without waiting

Cancelling a crawl

Downloading the raw HTML and Markdown

Re-extracting with a different prompt

Browsing your crawl history

Watching jobs (streaming results)

Logs

Usage statistics

Retries and reliability

Error handling

Verifying webhooks