> ## Documentation Index
> Fetch the complete documentation index at: https://docs.spidra.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Extract from Crawl

> Run a new extraction on pages from a completed crawl job without re-crawling the site

<Tip>
  This does not re-crawl the website. Spidra reads the HTML and markdown already saved from the original crawl.
</Tip>

## Prerequisites

<Warning>
  The source crawl job **must have a `completed` status** before you call this endpoint. Calling `/extract` on a job that is still running, pending, or failed will return a `400 Bad Request`.

  Poll `GET /crawl/{jobId}` and wait for `"status": "completed"` before proceeding.
</Warning>

## Request Body

| Field                  | Type   | Required | Description                                                                                   |
| ---------------------- | ------ | -------- | --------------------------------------------------------------------------------------------- |
| `transformInstruction` | string | Yes      | The extraction prompt to apply to every page from the source crawl. Maximum 5,000 characters. |

## How It Works

1. Pass the `jobId` of a **completed** crawl job. If you ran the crawl previously, this is the `id` field shown in your crawl history.
2. Provide a `transformInstruction` describing what you want to extract.
3. Spidra loads the saved content for each page and runs your prompt against it.
4. A new crawl job is created with the results, which you can poll and download the same way as any other job.

## When to Use This

* You want to extract different fields from pages you already crawled
* Your first extraction prompt wasn't quite right and you want to try again
* You need the same pages in two different formats, like JSON and CSV

## Polling Results

The response returns a new `jobId`. Use the standard crawl endpoints to check progress and get results:

| Endpoint                             | Purpose                     |
| ------------------------------------ | --------------------------- |
| `GET /crawl/{jobId}`                 | Poll job status             |
| `GET /crawl/{jobId}/pages`           | Get extracted data per page |
| `GET /crawl/{jobId}/download`        | Download results as ZIP     |
| `POST /crawl/{jobId}/retry/{pageId}` | Retry a specific page       |

## Common Errors

| Status | Error message                                           | Cause                                                                                 |
| ------ | ------------------------------------------------------- | ------------------------------------------------------------------------------------- |
| `400`  | `Source crawl job has not completed successfully`       | You called `/extract` before the source job finished. Wait for `status: "completed"`. |
| `422`  | `Missing required field: transformInstruction`          | The request body is missing the `transformInstruction` field.                         |
| `422`  | `transformInstruction must be 5000 characters or fewer` | Your prompt exceeds the 5,000 character limit.                                        |
| `403`  | `You have exceeded your monthly credit limit.`          | Not enough credits remaining. Check your usage at `GET /usage`.                       |
| `404`  | `Source crawl job not found`                            | The `jobId` does not exist or does not belong to your account.                        |


## OpenAPI

````yaml POST /crawl/{jobId}/extract
openapi: 3.1.0
info:
  title: Spidra API
  version: 1.0.0
  description: >-
    Public API endpoints for web scraping via Spidra. Authenticate with
    `Authorization: Bearer YOUR_API_KEY`.
servers:
  - url: https://api.spidra.io/api
security:
  - BearerAuth: []
  - ApiKeyAuth: []
paths:
  /crawl/{jobId}/extract:
    post:
      tags:
        - Crawling
      summary: Re-Extract from Existing Crawl
      description: >-
        Run a new AI extraction on pages already crawled by a previous completed
        job. The original HTML and markdown are reused from storage — no
        re-crawling occurs. Only AI token credits are charged.
      parameters:
        - name: jobId
          in: path
          required: true
          description: The ID of the completed source crawl job to extract from
          schema:
            type: string
      requestBody:
        required: true
        content:
          application/json:
            schema:
              type: object
              required:
                - transformInstruction
              properties:
                transformInstruction:
                  type: string
                  description: >-
                    Extraction prompt to apply to all pages from the source
                    crawl. Maximum 5,000 characters.
                  maxLength: 5000
            example:
              transformInstruction: Extract only the product price and availability status
      responses:
        '202':
          description: Extraction job queued
          content:
            application/json:
              schema:
                type: object
                properties:
                  status:
                    type: string
                    example: queued
                  jobId:
                    type: string
                    description: New job ID to poll for results
                  message:
                    type: string
              example:
                status: queued
                jobId: 661e8400-e29b-41d4-a716-446655441111
                message: >-
                  Extraction job queued. Poll
                  /api/crawl/661e8400-e29b-41d4-a716-446655441111 for results.
        '400':
          description: Bad request — see error message for details
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
              examples:
                jobNotCompleted:
                  summary: Source job not completed
                  value:
                    status: error
                    message: Source crawl job has not completed successfully
                missingInstruction:
                  summary: Missing transformInstruction
                  value:
                    status: error
                    message: 'Missing required field: transformInstruction'
                instructionTooLong:
                  summary: transformInstruction exceeds 5000 characters
                  value:
                    status: error
                    message: transformInstruction must be 5000 characters or fewer
        '403':
          description: Credit limit exceeded or unauthorized
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
              example:
                status: error
                message: Insufficient credits to run extraction
        '404':
          description: Source crawl job not found
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
              example:
                status: error
                message: Crawl job not found
components:
  schemas:
    ErrorResponse:
      type: object
      properties:
        status:
          type: string
          enum:
            - error
        message:
          type: string
      required:
        - status
        - message
  securitySchemes:
    BearerAuth:
      type: http
      scheme: bearer
      bearerFormat: API key
    ApiKeyAuth:
      type: apiKey
      in: header
      name: x-api-key

````