Installation
Get your API key from app.spidra.io under Settings → API Keys.
Store it as an environment variable. Never hardcode it.
Getting started
$spidra->scrape, $spidra->batch, $spidra->crawl, $spidra->logs, and $spidra->usage.
Scraping
The scraper accepts up to three URLs per request and processes them in parallel. You can pass a plain extraction prompt, a full JSON schema, per-URL browser actions, or any mix of those. The simplest path isrun() — it submits the job and blocks until it finishes, then returns the result:
submit() returns a jobId immediately. You can then call get() whenever you’re ready to check:
waiting → active → completed (or failed).
Scrape parameters
| Parameter | Type | Description |
|---|---|---|
urls | array | Up to 3 URLs. Each entry is ['url' => '...', 'actions' => [...]] |
prompt | string | What to extract. Written in plain English |
output | string | "markdown" (default) or "json" |
schema | array | JSON Schema — forces a specific shape when using output: "json" |
useProxy | bool | Route through a residential proxy |
proxyCountry | string | Two-letter country code: "us", "de", "jp", etc. |
extractContentOnly | bool | Strip nav, ads, and boilerplate before the AI sees the page |
screenshot | bool | Capture a viewport screenshot |
fullPageScreenshot | bool | Capture a full-page (scrolled) screenshot |
cookies | string | Raw Cookie header string for pages behind a login |
Enforcing an exact output shape
Without a schema, the AI extracts what it finds. With a schema, missing fields come back asnull rather than guessed values — useful when the output feeds a database or a typed pipeline downstream.
Scraping geo-restricted content
Some sites serve different prices or content depending on where you’re browsing from. SetuseProxy and proxyCountry to route through a residential IP in that country:
us, gb, de, fr, jp, au, ca, br, in, nl, and 40+ more. Use "global" or "eu" for regional routing without pinning to a specific country.
Scraping pages behind a login
If the page requires a session, pass your cookies as a raw header string. The easiest way to get this is to log in through your browser’s devtools, then copy theCookie header from any authenticated request.
Browser actions
Sometimes you need to interact with the page before extraction — dismiss a cookie banner, type into a search box, scroll to load lazy content. Pass anactions array inside the URL entry and they’ll run in order before the AI sees the page:
selector you can pass a CSS selector or XPath. If you’d rather describe the element in plain English, use value — Spidra will locate it with AI.
| Action | What it does |
|---|---|
click | Click any element — use selector for CSS, value for plain text |
type | Type into an input or textarea |
check | Check a checkbox |
uncheck | Uncheck a checkbox |
wait | Pause for duration milliseconds |
scroll | Scroll to a percentage of the page height (e.g. "80%") |
forEach | Loop over every matched element and extract from each one |
Controlling how long run() waits
By defaultrun() polls every 3 seconds and gives up after 120 seconds. You can override both:
batch->run() and crawl->run().
Batch scraping
When you have a list of URLs to process, batch is the right tool. You can submit up to 50 URLs in a single request and they all run in parallel. Unlike the scraper, each URL here is a plain string — there’s no per-URL actions support.items moves through pending → running → completed (or failed). The batch itself follows the same lifecycle, plus a cancelled state if you stop it early.
If you don’t want to wait for the whole batch to finish, use submit() and get() separately:
Retrying failures and cancelling
If some items fail (transient network errors, timeouts), you can retry just those without re-running the ones that already succeeded:Crawling
Crawling is different from scraping — you give it a starting URL and it discovers and processes pages on its own, following links according to your instructions. Good for indexing a docs site, monitoring a competitor’s blog, or building a structured dataset from an entire section of a site.crawlInstruction tells the crawler which links to follow. transformInstruction tells the AI what to extract from each page it visits. maxPages is a safety cap — the crawl stops once it hits that number.
The same useProxy, proxyCountry, and cookies options from the scraper work here too.
Just like scraping, you can fire-and-forget with submit() and poll with get():
Downloading the raw content
Once a crawl completes, you can fetch signed URLs to download the raw HTML and Markdown for every page that was crawled. These links expire after an hour:Re-extracting with a different prompt
If you crawled a site and want to pull out different information — say you originally extracted titles and summaries, but now you need prices — you don’t have to re-crawl.extract() runs a new AI pass over the already-crawled content and charges only transformation credits:
Browsing your crawl history
Logs
Every scrape request your API key makes gets logged automatically. You can filter by status, URL, date range, or where it came from (API vs playground):Usage statistics
Check how many requests and credits your account has used over a given period:"7d" gives one row per day for the last week. "30d" gives the last month. "weekly" gives one row per week for the last seven weeks.
Error handling
Every API error is mapped to a typed exception, so you can catch exactly what you care about and ignore the rest:| Exception | HTTP | Meaning |
|---|---|---|
AuthenticationException | 401 | The API key is missing or invalid |
InsufficientCreditsException | 403 | No credits remaining on the account |
RateLimitException | 429 | Too many requests — back off |
ServerException | 500 | Unexpected server-side error |
SpidraException | any | Base class for all Spidra exceptions |
getCode() for the HTTP status and getMessage() for a human-readable explanation.
