Skip to main content
POST
/
scrape
curl --request POST \ --url https://api.spidra.io/api/scrape \ --header 'Content-Type: application/json' \ --header 'x-api-key: <api-key>' \ --data ' { "urls": [ { "url": "https://example.com" } ], "prompt": "Extract the main heading and first paragraph", "output": "json" } '
{
  "status": "queued",
  "jobId": "550e8400-e29b-41d4-a716-446655440000",
  "message": "Scrape job has been queued. Poll /api/scrape/550e8400-e29b-41d4-a716-446655440000 to get the result."
}

How It Works

Spidra runs scrape jobs asynchronously. When you submit a request, you get a jobId back immediately. You then poll GET /scrape/{jobId} until status is completed and results are ready.
  1. Submit - Send your request, receive a jobId in the response right away
  2. Load - Spidra opens each URL in a real browser
  3. Execute - Runs your browser actions (clicks, scrolls, etc.)
  4. Solve - Automatically handles CAPTCHAs
  5. Process - Runs AI extraction if a prompt is provided
  6. Poll - Check GET /scrape/{jobId} until status: "completed"
Setting output: "json" without a prompt still triggers a default AI extraction pass. If you want raw markdown with no AI processing, omit both output and prompt.When AI extraction fails (for example, on a near-empty page), Spidra falls back to returning the raw page markdown in markdownContent. Check the ai_extraction_failed flag in the response to detect this case and handle degraded results in your code.

Structured Output

Pass a schema to tell the AI exactly what shape to return. Instead of getting whatever JSON the AI decides to produce, you get back a JSON object that matches your schema every time. Nullable fields come back as null rather than being omitted. Field names match exactly what you defined.
{
  "urls": [{ "url": "https://jobs.example.com/engineer" }],
  "prompt": "Extract the job details. Normalize salary to a plain number in USD.",
  "schema": {
    "type": "object",
    "required": ["title", "company", "remote", "employment_type"],
    "properties": {
      "title":           { "type": "string" },
      "company":         { "type": "string" },
      "remote":          { "type": ["boolean", "null"] },
      "salary_min":      { "type": ["number", "null"] },
      "salary_max":      { "type": ["number", "null"] },
      "employment_type": {
        "type": ["string", "null"],
        "enum": ["full_time", "part_time", "contract", null]
      }
    }
  }
}
output is automatically set to "json" when a schema is provided. The schema is validated before the job is queued and a 422 is returned with descriptive errors if the schema is malformed. Non-fatal issues (unsupported keywords) are returned as schema_warnings in the job status response.

Structured Output Guide

Full guide: nested objects, arrays, nullable fields, the required rule, and schema limits

Zod and Pydantic

Generate your schema from an existing Zod or Pydantic model

Browser Actions

Interact with the page before scraping — click buttons, fill forms, scroll to load content, dismiss modals, and iterate over lists of elements.
{
  "urls": [{
    "url": "https://example.com/products",
    "actions": [
      {"type": "click", "selector": "#accept-cookies"},
      {"type": "wait", "duration": 1500},
      {"type": "scroll", "to": "80%"}
    ]
  }],
  "prompt": "List all product names and prices",
  "output": "json"
}

Available Actions

ActionWhat it doesQuick example
clickClicks any element on the page, buttons, links, tabs, toggles{"type": "click", "selector": "#load-more"}
typeTypes text into an input field or search box{"type": "type", "selector": "#search", "value": "laptops"}
checkChecks a checkbox{"type": "check", "selector": "#in-stock-only"}
uncheckUnchecks a checkbox{"type": "uncheck", "selector": "#newsletter"}
waitPauses for a number of milliseconds{"type": "wait", "duration": 2000}
scrollScrolls the page to a percentage of its height{"type": "scroll", "to": "80%"}
forEachFinds all matching elements and processes each one individually. Supports navigate, click, and inline modes.{"type": "forEach", "observe": "Find all product cards", "mode": "navigate"}
For click, check, and uncheck you can target elements by CSS selector like "selector": "#id" or by a plain English description like "value": "Accept cookies button". Both work.
forEach is the most powerful action. It finds a set of repeating elements (product cards, links, accordion rows) and runs a mini-scrape on each one. It supports three modes (click, inline, navigate), automatic pagination, per-item AI extraction, and per-element sub-actions.

Full Browser Actions Guide

Detailed explanations for every action with real examples, all forEach options, chaining patterns, and sample responses

Proxy and Geo-Targeting

Route requests through residential proxies to avoid detection or access geo-restricted content. Set "useProxy": true and optionally add "proxyCountry" to target a specific location.
{
  "urls": [{"url": "https://amazon.de/dp/B123456"}],
  "prompt": "Extract the product price in euros",
  "output": "json",
  "useProxy": true,
  "proxyCountry": "de"
}

Stealth Mode & Geo-Targeting Guide

Full guide: country list, EU rotation, examples, and credit costs

Extract Content Only

Remove navigation, headers, footers, and sidebars before processing. Useful when you only want the main article or product content.
{
  "urls": [{"url": "https://blog.example.com/article"}],
  "prompt": "Summarize this article",
  "output": "json",
  "extractContentOnly": true
}

Screenshots

Capture screenshots of scraped pages for debugging or archival.
{
  "urls": [{"url": "https://example.com"}],
  "prompt": "Extract page title",
  "screenshot": true,
  "fullPageScreenshot": true
}
OptionDescription
screenshot: trueCapture the visible viewport
fullPageScreenshot: trueCapture the entire scrollable page (requires screenshot: true)
Screenshot URLs are returned in the screenshots array of the response.

Authentication

Scrape protected pages by providing session cookies:
{
  "urls": [{"url": "https://example.com/dashboard"}],
  "prompt": "Extract account details",
  "output": "json",
  "cookies": "session=eyJ...; auth_token=abc123..."
}

Authenticated Scraping

Full guide on getting cookies and formats

Check Job Status

Poll for results

View Logs

See your scrape history

Authorizations

x-api-key
string
header
required

Body

application/json
urls
object[]
required

Array of URLs to scrape (1-3 URLs per request)

Required array length: 1 - 3 elements
prompt
string

Optional LLM prompt for extracting or transforming the scraped content

output
enum<string>
default:json

Output format for the extracted content

Available options:
json,
markdown
useProxy
boolean
default:false

Enable stealth mode with proxy rotation to avoid detection

proxyCountry
string

Country code (e.g., 'us', 'uk', 'de') or region ('global', 'asia', 'eu') for geo-targeted proxy routing. Requires useProxy: true

cookies
string

Session cookies for authenticated scraping. Supports standard format (name=value; name2=value2) or raw Chrome DevTools paste format

screenshot
boolean
default:false

Capture a screenshot of each page after scraping

fullPageScreenshot
boolean
default:false

Capture full page screenshot instead of just the viewport. Requires screenshot: true

extractContentOnly
boolean
default:false

Remove headers, footers, navigation, and other non-content elements from the scraped output

schema
object

JSON Schema object describing the exact shape of the AI output. When provided, the AI must return JSON matching this schema. Output is automatically set to 'json'. Root must be type 'object'. Maximum nesting depth: 5. Maximum size: 10KB.

Response

Job successfully queued

status
enum<string>
Available options:
queued
jobId
string

Unique job identifier for polling

message
string
deduplicated
boolean

True if an identical request was made within the last 5 seconds and this returns the existing job ID