Get Crawled Pages
Crawl Endpoints
Get Crawled Pages
Retrieve all pages from a crawl job, including the AI-extracted data and temporary signed URLs to the original HTML and markdown content.
GET
Get Crawled Pages
Once a crawl job completes, this endpoint returns every page that was processed, along with the data extracted by the AI. Each page record includes the original URL, extraction status, and the structured
data field containing whatever your transformInstruction asked for.
The response also includes time-limited signed URLs pointing to the raw HTML and markdown files stored in Spidra’s object storage. These URLs are valid for one hour.
Example Request
Response Fields
| Field | Type | Description |
|---|---|---|
pages | array | List of all pages processed by this job |
pages[].id | string | Unique page ID. Use this when calling POST /crawl//retry/ |
pages[].url | string | The URL of this specific page |
pages[].title | string | Page title as detected during crawling |
pages[].status | string | success, failed, or pending |
pages[].data | object or string | The AI-extracted data for this page. The shape matches your transformInstruction |
pages[].error_message | string or null | Error details if status is failed |
pages[].html_url | string or null | Signed URL to the raw HTML file (valid for 1 hour) |
pages[].markdown_url | string or null | Signed URL to the markdown version of the page (valid for 1 hour) |
pages[].created_at | string | ISO 8601 timestamp when this page was processed |
Example Response
Handling Failed Pages
Pages withstatus: "failed" still appear in the response so you have a full picture of what was and was not processed. You can use the error_message field to understand what went wrong. To re-run extraction on a specific failed page, use the retry endpoint available on your account.

