> ## Documentation Index
> Fetch the complete documentation index at: https://docs.spidra.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Python

> Official Python SDK for the Spidra web scraping API. Async-first with sync wrappers. Scrape, batch-process URLs, crawl entire sites, and run browser actions from Python.

The Python SDK wraps the Spidra API so you're not writing raw HTTP calls and polling loops yourself. It handles job submission, status polling, retry logic, and error mapping to typed exceptions. The SDK is async by design, with synchronous wrappers on every method so it works anywhere — async scripts, Django views, Flask routes, or Jupyter notebooks.

## Installation

```bash theme={null}
pip install spidra
```

Requires Python 3.9 or higher.

<Note>
  Get your API key from [app.spidra.io](https://app.spidra.io) under **Settings → API Keys**.
  Store it as an environment variable. Never hardcode it.
</Note>

## Getting started

```python theme={null}
from spidra import SpidraClient

spidra = SpidraClient(api_key="spd_YOUR_API_KEY")
```

From here you access everything through `spidra.scrape`, `spidra.batch`, `spidra.crawl`, `spidra.logs`, and `spidra.usage`.

Every method is `async` by default. If you're not in an async context, each method has a `_sync` counterpart that works anywhere:

```python theme={null}
# Async (inside an async function or Jupyter cell)
job = await spidra.scrape.run(params)

# Synchronous (anywhere — scripts, Django, Flask, etc.)
job = spidra.scrape.run_sync(params)
```

The sync wrappers handle event loop detection automatically, including Jupyter notebooks where calling `asyncio.run()` directly would fail.

## Scraping

The scraper accepts up to three URLs per request and processes them in parallel. You can pass a plain extraction prompt, a JSON schema, per-URL browser actions, or any combination of those.

The simplest path is `run()` — it submits the job and blocks until it finishes, then returns the result:

```python theme={null}
from spidra import SpidraClient, ScrapeParams, ScrapeUrl

spidra = SpidraClient(api_key="spd_YOUR_API_KEY")

job = await spidra.scrape.run(ScrapeParams(
    urls=[ScrapeUrl(url="https://example.com/pricing")],
    prompt="Extract all pricing plans with name, price, and included features",
    output="json",
))

print(job.result.content)
# {"plans": [{"name": "Starter", "price": "$9/mo", ...}]}
```

If you'd rather fire and move on, `submit()` returns a job ID immediately. You can then call `get()` whenever you're ready to check:

```python theme={null}
queued = await spidra.scrape.submit(ScrapeParams(
    urls=[ScrapeUrl(url="https://example.com")],
    prompt="Extract the main headline",
))

# Later...
status = await spidra.scrape.get(queued.job_id)

if status.status == "completed":
    print(status.result.content)
```

Job statuses move through: `queued` → `waiting` → `active` → `completed` (or `failed`).

### Scrape parameters

| Parameter              | Type | Description                                                            |
| ---------------------- | ---- | ---------------------------------------------------------------------- |
| `urls`                 | list | Up to 3 `ScrapeUrl` objects. Each takes a `url` and optional `actions` |
| `prompt`               | str  | What to extract, written in plain English                              |
| `output`               | str  | `"markdown"` (default) or `"json"`                                     |
| `schema`               | dict | JSON Schema that forces a specific output shape                        |
| `use_proxy`            | bool | Route through a residential proxy                                      |
| `proxy_country`        | str  | Two-letter country code: `"us"`, `"de"`, `"jp"`, etc.                  |
| `extract_content_only` | bool | Strip nav, ads, and boilerplate before the AI sees the page            |
| `screenshot`           | bool | Capture a viewport screenshot                                          |
| `full_page_screenshot` | bool | Capture a full-page scrolled screenshot                                |
| `cookies`              | str  | Raw `Cookie` header string for pages behind a login                    |

### Enforcing an exact output shape

Without a schema the AI extracts what it finds. With a schema, missing fields come back as `None` rather than guessed values, which matters when the output feeds a database or a typed pipeline downstream:

```python theme={null}
job = await spidra.scrape.run(ScrapeParams(
    urls=[ScrapeUrl(url="https://jobs.example.com/senior-engineer")],
    prompt="Extract the job listing details",
    output="json",
    schema={
        "type": "object",
        "required": ["title", "company", "remote"],
        "properties": {
            "title":      {"type": "string"},
            "company":    {"type": "string"},
            "remote":     {"type": ["boolean", "null"]},
            "salary_min": {"type": ["number", "null"]},
            "skills":     {"type": "array", "items": {"type": "string"}},
        },
    },
))
```

### Scraping geo-restricted content

Some sites serve different prices or content depending on where you're browsing from. Set `use_proxy=True` and a `proxy_country` code to route through a residential IP in that country:

```python theme={null}
job = await spidra.scrape.run(ScrapeParams(
    urls=[ScrapeUrl(url="https://www.amazon.de/gp/bestsellers")],
    prompt="List the top 10 products with name and price",
    use_proxy=True,
    proxy_country="de",
))
```

Supported country codes include `us`, `gb`, `de`, `fr`, `jp`, `au`, `ca`, `br`, `in`, `nl`, and [40+ more](/features/stealth-mode#country-targeting). Use `"global"` or `"eu"` for regional routing without pinning to a specific country.

### Scraping pages behind a login

If the page requires a session, pass your cookies as a raw header string. The easiest way to get this is to log in through your browser, open devtools, and copy the `Cookie` header from any authenticated request:

```python theme={null}
job = await spidra.scrape.run(ScrapeParams(
    urls=[ScrapeUrl(url="https://app.example.com/dashboard")],
    prompt="Extract the monthly revenue and active user count",
    cookies="session=abc123; auth_token=xyz789",
))
```

### Browser actions

Sometimes you need to interact with the page before extraction — dismiss a cookie banner, type into a search box, scroll to load lazy content. Pass an `actions` list inside the `ScrapeUrl` and they run in order before the AI sees the page:

```python theme={null}
from spidra import BrowserAction

job = await spidra.scrape.run(ScrapeParams(
    urls=[
        ScrapeUrl(
            url="https://example.com/products",
            actions=[
                BrowserAction(type="click", selector="#accept-cookies"),
                BrowserAction(type="wait", duration=1000),
                BrowserAction(type="scroll", to="80%"),
            ],
        ),
    ],
    prompt="Extract all product names and prices visible on the page",
))
```

For `selector` you can pass a CSS selector or XPath. If you'd rather describe the element in plain English, use `value` and Spidra will locate it with AI.

| Action    | What it does                                                       |
| --------- | ------------------------------------------------------------------ |
| `click`   | Click any element — use `selector` for CSS, `value` for plain text |
| `type`    | Type into an input or textarea                                     |
| `check`   | Check a checkbox                                                   |
| `uncheck` | Uncheck a checkbox                                                 |
| `wait`    | Pause for `duration` milliseconds                                  |
| `scroll`  | Scroll to a percentage of the page height, e.g. `"80%"`            |
| `forEach` | Loop over every matched element and extract from each one          |

### Controlling how long run() waits

By default `run()` polls every 3 seconds and gives up after 120 seconds. You can override both by passing a `PollOptions` object:

```python theme={null}
from spidra import PollOptions

job = await spidra.scrape.run(
    ScrapeParams(urls=[ScrapeUrl(url="https://example.com")], prompt="..."),
    PollOptions(poll_interval=5, timeout=60),
)
```

The same options work on `batch.run()` and `crawl.run()`.

## Batch scraping

When you have a list of URLs to process, batch is the right tool. You can submit up to 50 URLs in a single request and they all run in parallel. Unlike the scraper, each URL here is a plain string — there's no per-URL actions support in batch mode.

```python theme={null}
from spidra import BatchScrapeParams

batch = await spidra.batch.run(BatchScrapeParams(
    urls=[
        "https://shop.example.com/product/1",
        "https://shop.example.com/product/2",
        "https://shop.example.com/product/3",
    ],
    prompt="Extract product name, price, and whether it is in stock",
    output="json",
))

print(f"{batch.completed_count}/{batch.total_urls} completed")

for item in batch.items:
    if item.status == "completed":
        print(item.url, item.result)
    else:
        print(f"Failed: {item.url} — {item.error}")
```

Each item moves through `pending` → `running` → `completed` (or `failed`). The batch itself follows the same lifecycle, plus a `cancelled` state if you stop it early.

If you don't want to wait for the whole thing to finish, use `submit()` and `get()` separately:

```python theme={null}
queued = await spidra.batch.submit(BatchScrapeParams(
    urls=["https://example.com/1", "https://example.com/2"],
    prompt="Extract the page title and meta description",
))

# Come back later
result = await spidra.batch.get(queued.batch_id)
print(f"{result.completed_count} of {result.total_urls} done")
```

### Retrying failures and cancelling

If some items fail due to timeouts or transient errors, you can retry just those without re-running the ones that already succeeded:

```python theme={null}
if batch.failed_count > 0:
    await spidra.batch.retry(queued.batch_id)
```

To stop a running batch and get credits back for anything that hasn't started yet:

```python theme={null}
response = await spidra.batch.cancel(batch_id)
print(f"Cancelled {response.cancelled_items} items, refunded {response.credits_refunded} credits")
```

To look through past batches:

```python theme={null}
from spidra import BatchListParams

page = await spidra.batch.list(BatchListParams(page=1, limit=20))

for job in page.jobs:
    print(job.uuid, job.status, f"{job.completed_count}/{job.total_urls}")
```

## Crawling

Crawling is different from scraping. You give it a starting URL and it discovers and processes pages on its own, following links according to your instructions. Good for indexing a docs site, monitoring a competitor's blog, or building a structured dataset from an entire section of a site.

```python theme={null}
from spidra import CrawlParams

job = await spidra.crawl.run(
    CrawlParams(
        base_url="https://competitor.com/blog",
        crawl_instruction="Follow links to blog posts only, skip tag pages, category pages, and the homepage",
        transform_instruction="Extract the post title, author name, publish date, and a one-sentence summary",
        max_pages=30,
        use_proxy=True,
    ),
    PollOptions(timeout=360),
)

for page in job.result:
    print(page.url, page.data)
```

`crawl_instruction` tells the crawler which links to follow. `transform_instruction` tells the AI what to extract from each page it visits. `max_pages` is a safety cap so the crawl doesn't run indefinitely. The default timeout for `crawl.run()` is 120 seconds — pass a higher value for bigger crawls.

The same `use_proxy`, `proxy_country`, and `cookies` options from the scraper work here too.

Just like scraping, you can fire-and-forget with `submit()` and poll with `get()`:

```python theme={null}
queued = await spidra.crawl.submit(CrawlParams(
    base_url="https://example.com/docs",
    crawl_instruction="Follow all documentation pages",
    transform_instruction="Extract the page title and a short summary of the content",
    max_pages=50,
))

status = await spidra.crawl.get(queued.job_id)
# status moves through: waiting → active → running → completed (or failed)
```

### Downloading the raw content

Once a crawl completes, you can fetch signed URLs to download the raw HTML and Markdown for every page that was crawled. These links expire after an hour:

```python theme={null}
response = await spidra.crawl.pages(job_id)

for page in response.pages:
    # page.html_url     — download the raw HTML
    # page.markdown_url — download the cleaned Markdown
    print(page.url, page.status)
```

### Re-extracting with a different prompt

If you crawled a site and want to pull out different information, you don't have to re-crawl. `extract()` runs a new AI pass over the already-crawled content and charges only transformation credits:

```python theme={null}
queued = await spidra.crawl.extract(
    completed_job_id,
    "Extract only product SKUs and prices as structured JSON",
)

# This creates a new job — poll it like any other
result = await spidra.crawl.get(queued.job_id)
```

### Browsing your crawl history

```python theme={null}
from spidra import CrawlHistoryParams

response = await spidra.crawl.history(CrawlHistoryParams(page=1, limit=10))
print(f"Total crawl jobs: {response.total}")

stats = await spidra.crawl.stats()
print(f"All-time: {stats.total}")
```

## Logs

Every scrape request your API key makes gets logged automatically. You can filter by status, URL, date range, or where it came from:

```python theme={null}
from spidra import ScrapeLogsParams

response = await spidra.logs.list(ScrapeLogsParams(
    status="failed",
    search_term="amazon.com",
    date_start="2024-01-01",
    date_end="2024-12-31",
    page=1,
    limit=20,
))

for log in response.logs:
    print(log.urls[0].get("url"), log.status, log.credits_used)
```

To fetch the full details of a single log entry including the AI extraction output:

```python theme={null}
log = await spidra.logs.get(log_uuid)
print(log.result_data)
```

## Usage statistics

Check how many requests and credits your account has used over a given period:

```python theme={null}
rows = await spidra.usage.get("30d")  # "7d" | "30d" | "weekly"

for row in rows:
    print(row.date, row.requests, row.credits)
```

`"7d"` gives one row per day for the last week. `"30d"` gives the last month. `"weekly"` gives one row per week for the last seven weeks.

## Error handling

Every API error is mapped to a typed exception class, so you can catch exactly what you care about and let the rest bubble up:

```python theme={null}
from spidra import (
    SpidraError,
    SpidraAuthenticationError,
    SpidraInsufficientCreditsError,
    SpidraRateLimitError,
    SpidraServerError,
)

try:
    job = await spidra.scrape.run(ScrapeParams(
        urls=[ScrapeUrl(url="https://example.com")],
        prompt="Extract the main headline",
    ))
except SpidraAuthenticationError:
    # 401: No x-api-key header sent
    pass
except SpidraInsufficientCreditsError:
    # 403: Invalid API key, or monthly credit limit reached — check e.message to distinguish
    pass
except SpidraRateLimitError:
    # Slow down — you're hitting limits
    pass
except SpidraServerError as e:
    # Something went wrong on Spidra's side — retry is usually safe
    print(f"Server error ({e.status}): {e.message}")
except SpidraError as e:
    # Catch-all for anything else
    print(f"API error {e.status}: {e.message}")
```

| Exception                        | Status | When                                                                        |
| -------------------------------- | ------ | --------------------------------------------------------------------------- |
| `SpidraAuthenticationError`      | 401    | No `x-api-key` header sent                                                  |
| `SpidraInsufficientCreditsError` | 403    | Invalid API key, or no credits remaining — check `e.message` to distinguish |
| `SpidraRateLimitError`           | 429    | Too many requests — back off                                                |
| `SpidraServerError`              | 500    | Unexpected server-side error                                                |
| `SpidraError`                    | any    | Base class for all Spidra exceptions                                        |

All exceptions expose `.status` for the HTTP status code and `.message` for a human-readable explanation.

<CardGroup cols={2}>
  <Card title="Ruby" icon="gem" href="/sdks/ruby">
    Official Ruby SDK — pure stdlib, no external dependencies. Works in Rails, Sinatra, and scripts.
  </Card>

  <Card title="Elixir" icon="erlang" href="/sdks/elixir">
    Official Elixir SDK — idiomatic pattern matching, OTP-ready, works with Phoenix and plain Mix projects.
  </Card>
</CardGroup>