> ## Documentation Index
> Fetch the complete documentation index at: https://docs.spidra.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Get Crawled Pages

> Retrieve all pages from a completed crawl job, including extracted data and signed URLs to the raw HTML and markdown files.

Returns every page processed by a crawl job. Call this once the job status is `completed`.

Each page record includes the extracted content in `data`, plus signed URLs to the original HTML snapshot and markdown version stored by Spidra.

## Example Request

<CodeGroup>
  ```bash cURL theme={null}
  curl https://api.spidra.io/api/crawl/abc-123/pages \
    -H "Authorization: Bearer YOUR_API_KEY"
  ```

  ```python Python theme={null}
  import requests

  response = requests.get(
      "https://api.spidra.io/api/crawl/abc-123/pages",
      headers={"Authorization": "Bearer YOUR_API_KEY"}
  )
  ```

  ```javascript Node.js theme={null}
  const response = await fetch("https://api.spidra.io/api/crawl/abc-123/pages", {
    headers: { Authorization: "Bearer YOUR_API_KEY" }
  });
  ```
</CodeGroup>

## Response Fields

| Field                   | Type           | Description                                                                                                                                                                                                |
| ----------------------- | -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `pages`                 | array          | All pages processed by this job, including failed ones.                                                                                                                                                    |
| `pages[].id`            | string         | Unique page ID. Pass this to [POST /crawl/{jobId}/extract](/api-reference/crawling/crawl-extract) to re-run extraction on a specific page.                                                                 |
| `pages[].url`           | string         | The URL of this page.                                                                                                                                                                                      |
| `pages[].title`         | string         | Page title as detected during crawling.                                                                                                                                                                    |
| `pages[].status`        | string         | `success` or `failed`.                                                                                                                                                                                     |
| `pages[].data`          | any            | Extracted content for this page. When a `transformInstruction` or `schema` was provided, this contains AI-extracted structured data. When neither was set, this is the raw page markdown — no AI was used. |
| `pages[].error_message` | string or null | Error details when `status` is `failed`.                                                                                                                                                                   |
| `pages[].html`          | string or null | Signed URL to the raw HTML snapshot. Valid for 1 hour.                                                                                                                                                     |
| `pages[].markdown`      | string or null | Signed URL to the markdown version of this page. Valid for 1 hour.                                                                                                                                         |
| `pages[].created_at`    | string         | ISO 8601 timestamp when this page was processed.                                                                                                                                                           |

## Example Response

```json theme={null}
{
  "pages": [
    {
      "id": "page-uuid-1",
      "url": "https://example.com/blog/how-to-scrape",
      "title": "How to Scrape the Web Without Getting Blocked",
      "status": "success",
      "data": {
        "title": "How to Scrape the Web Without Getting Blocked",
        "author": "Jane Smith",
        "published": "2025-11-20",
        "summary": "A guide to rotating proxies and handling JavaScript-heavy pages."
      },
      "error_message": null,
      "html": "https://storage.spidra.io/signed/...",
      "markdown": "https://storage.spidra.io/signed/...",
      "created_at": "2025-12-17T15:02:10Z"
    },
    {
      "id": "page-uuid-2",
      "url": "https://example.com/blog/javascript-rendering",
      "title": "JavaScript Rendering Explained",
      "status": "failed",
      "data": null,
      "error_message": "AI transformation failed: content too short to extract",
      "html": null,
      "markdown": null,
      "created_at": "2025-12-17T15:02:55Z"
    }
  ]
}
```

## The `data` Field

What `data` contains depends on how you configured the job:

* **With `transformInstruction`** — `data` is whatever the AI extracted based on your prompt. It could be a string, an object, or structured JSON depending on what you asked for.
* **With `schema`** — `data` is a JSON object matching the schema you defined, with all fields present.
* **With neither** — `data` is the raw page markdown. No AI was involved and no token credits were charged. The `html` and `markdown` URLs point to the same content in its original format.

## Handling Failed Pages

Pages with `status: "failed"` still appear in the response. The `error_message` explains what went wrong. To re-run extraction on a failed page without re-crawling the site, use the [extract endpoint](/api-reference/crawling/crawl-extract).

<Tip>
  The `html` and `markdown` URLs expire after one hour. If you need permanent access to the raw files, use [Download Crawl Results](/api-reference/crawling/crawl-download) to get a full ZIP archive.
</Tip>


## OpenAPI

````yaml GET /crawl/{jobId}/pages
openapi: 3.1.0
info:
  title: Spidra API
  version: 1.0.0
  description: >-
    Public API endpoints for web scraping via Spidra. Authenticate with
    `Authorization: Bearer YOUR_API_KEY`.
servers:
  - url: https://api.spidra.io/api
security:
  - BearerAuth: []
  - ApiKeyAuth: []
paths:
  /crawl/{jobId}/pages:
    get:
      tags:
        - Crawling
      summary: Get Crawled Pages
      parameters:
        - name: jobId
          in: path
          required: true
          schema:
            type: string
      responses:
        '200':
          description: List of crawled pages with extracted data
          content:
            application/json:
              schema:
                type: object
                properties:
                  pages:
                    type: array
                    items:
                      type: object
                      properties:
                        id:
                          type: string
                        url:
                          type: string
                        title:
                          type: string
                        status:
                          type: string
                          enum:
                            - success
                            - failed
                            - pending
                        data:
                          description: >-
                            Extracted content for this page. When a
                            transformInstruction or schema was provided, this
                            contains AI-extracted structured data. When neither
                            was set, this contains the raw page markdown — no AI
                            was used.
                        error_message:
                          type: string
                          nullable: true
                        html:
                          type: string
                          nullable: true
                          description: >-
                            Signed URL to the raw HTML snapshot of this page.
                            Valid for 1 hour. Use the download endpoint for
                            permanent access.
                        markdown:
                          type: string
                          nullable: true
                          description: >-
                            Signed URL to the markdown version of this page.
                            Valid for 1 hour.
                        created_at:
                          type: string
                          format: date-time
              example:
                pages:
                  - id: page-uuid-1
                    url: https://example.com/blog/post-1
                    title: First Post
                    status: success
                    data:
                      title: First Post
                      author: John
                      date: '2025-01-01'
                    error_message: null
                    html: >-
                      https://storage.spidra.io/signed/crawl/abc-123/page1.html?...
                    markdown: >-
                      https://storage.spidra.io/signed/crawl/abc-123/page1.md?...
                    created_at: '2025-12-17T15:00:00Z'
                  - id: page-uuid-2
                    url: https://example.com/blog/broken-page
                    title: null
                    status: failed
                    data: null
                    error_message: 'AI transformation failed: content too short to extract'
                    html: null
                    markdown: null
                    created_at: '2025-12-17T15:01:30Z'
        '401':
          description: Invalid or missing API key
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
              example:
                status: error
                message: Access token invalid or expired
        '403':
          description: Not authorized to access this job
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
              example:
                status: error
                message: Unauthorized access or job not found
components:
  schemas:
    ErrorResponse:
      type: object
      properties:
        status:
          type: string
          enum:
            - error
        message:
          type: string
      required:
        - status
        - message
  securitySchemes:
    BearerAuth:
      type: http
      scheme: bearer
      bearerFormat: API key
    ApiKeyAuth:
      type: apiKey
      in: header
      name: x-api-key

````