Skip to main content
GET
/
crawl
/
{jobId}
/
pages
Get Crawled Pages
curl --request GET \
  --url https://api.spidra.io/api/crawl/{jobId}/pages \
  --header 'x-api-key: <api-key>'
{
  "pages": [
    {
      "id": "page-1",
      "url": "https://example.com/blog/post-1",
      "title": "First Post",
      "status": "success",
      "data": {
        "title": "First Post",
        "author": "John",
        "date": "2025-01-01"
      },
      "error_message": null,
      "created_at": "2025-12-17T15:00:00Z"
    }
  ]
}
Once a crawl job completes, this endpoint returns every page that was processed, along with the data extracted by the AI. Each page record includes the original URL, extraction status, and the structured data field containing whatever your transformInstruction asked for. The response also includes time-limited signed URLs pointing to the raw HTML and markdown files stored in Spidra’s object storage. These URLs are valid for one hour.

Example Request

curl https://api.spidra.io/api/crawl/abc-123/pages \
  -H "x-api-key: YOUR_API_KEY"

Response Fields

FieldTypeDescription
pagesarrayList of all pages processed by this job
pages[].idstringUnique page ID. Use this when calling POST /crawl//retry/
pages[].urlstringThe URL of this specific page
pages[].titlestringPage title as detected during crawling
pages[].statusstringsuccess, failed, or pending
pages[].dataobject or stringThe AI-extracted data for this page. The shape matches your transformInstruction
pages[].error_messagestring or nullError details if status is failed
pages[].html_urlstring or nullSigned URL to the raw HTML file (valid for 1 hour)
pages[].markdown_urlstring or nullSigned URL to the markdown version of the page (valid for 1 hour)
pages[].created_atstringISO 8601 timestamp when this page was processed

Example Response

{
  "pages": [
    {
      "id": "page-uuid-1",
      "url": "https://example.com/blog/how-to-scrape",
      "title": "How to Scrape the Web Without Getting Blocked",
      "status": "success",
      "data": {
        "title": "How to Scrape the Web Without Getting Blocked",
        "author": "Jane Smith",
        "published": "2025-11-20",
        "summary": "A guide to rotating proxies and handling JavaScript-heavy pages."
      },
      "error_message": null,
      "html_url": "https://storage.spidra.io/signed/...",
      "markdown_url": "https://storage.spidra.io/signed/...",
      "created_at": "2025-12-17T15:02:10Z"
    },
    {
      "id": "page-uuid-2",
      "url": "https://example.com/blog/javascript-rendering",
      "title": "JavaScript Rendering Explained",
      "status": "failed",
      "data": null,
      "error_message": "AI transformation failed: content too short to extract",
      "html_url": null,
      "markdown_url": null,
      "created_at": "2025-12-17T15:02:55Z"
    }
  ]
}

Handling Failed Pages

Pages with status: "failed" still appear in the response so you have a full picture of what was and was not processed. You can use the error_message field to understand what went wrong. To re-run extraction on a specific failed page, use the retry endpoint available on your account.
The signed URLs for html_url and markdown_url expire after one hour. If you need permanent access to the raw content, download it promptly or use Download Crawl Results to get a full ZIP archive.

Authorizations

x-api-key
string
header
required

Path Parameters

jobId
string
required

Response

List of crawled pages with extracted data

pages
object[]