Extract from Crawl

Re-Extract from Existing Crawl

curl --request POST \
  --url https://api.spidra.io/api/crawl/{jobId}/extract \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <api-key>' \
  --data '
{
  "transformInstruction": "Extract only the product price and availability status"
}
'

{
  "status": "queued",
  "jobId": "new-job-uuid",
  "message": "Extraction job queued. Poll /api/crawl/new-job-uuid for results."
}

POST

crawl

{jobId}

extract

Re-Extract from Existing Crawl

curl --request POST \
  --url https://api.spidra.io/api/crawl/{jobId}/extract \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <api-key>' \
  --data '
{
  "transformInstruction": "Extract only the product price and availability status"
}
'

{
  "status": "queued",
  "jobId": "new-job-uuid",
  "message": "Extraction job queued. Poll /api/crawl/new-job-uuid for results."
}

This does not re-crawl the website. Spidra reads the HTML and markdown already saved from the original crawl.

Prerequisites

The source crawl job must have a completed status before you call this endpoint. Calling /extract on a job that is still running, pending, or failed will return a 400 Bad Request.Poll GET /crawl/{jobId} and wait for "status": "completed" before proceeding.

Request Body

Field	Type	Required	Description
`transformInstruction`	string	Yes	The extraction prompt to apply to every page from the source crawl. Maximum 5,000 characters.

How It Works

Pass the jobId of a completed crawl job. If you ran the crawl previously, this is the id field shown in your crawl history.
Provide a transformInstruction describing what you want to extract.
Spidra loads the saved content for each page and runs your prompt against it.
A new crawl job is created with the results, which you can poll and download the same way as any other job.

When to Use This

You want to extract different fields from pages you already crawled
Your first extraction prompt wasn’t quite right and you want to try again
You need the same pages in two different formats, like JSON and CSV

Polling Results

The response returns a new jobId. Use the standard crawl endpoints to check progress and get results:

Endpoint	Purpose
`GET /crawl/{jobId}`	Poll job status
`GET /crawl/{jobId}/pages`	Get extracted data per page
`GET /crawl/{jobId}/download`	Download results as ZIP
`POST /crawl/{jobId}/retry/{pageId}`	Retry a specific page

Common Errors

Status	Error message	Cause
`400`	`Source crawl job has not completed successfully`	You called `/extract` before the source job finished. Wait for `status: "completed"`.
`400`	`Missing required field: transformInstruction`	The request body is missing the `transformInstruction` field.
`400`	`transformInstruction must be 5000 characters or fewer`	Your prompt exceeds the 5,000 character limit.
`403`	`You have exceeded your monthly credit limit.`	Not enough credits remaining. Check your usage at `GET /usage`.
`404`	`Source crawl job not found`	The `jobId` does not exist or does not belong to your account.

Authorizations

x-api-key

string

header

required

Path Parameters

jobId

string

required

The ID of the completed source crawl job to extract from

Body

application/json

transformInstruction

string

required

Extraction prompt to apply to all pages from the source crawl. Maximum 5,000 characters.

Maximum string length: 5000

Response

Extraction job queued

status

string

Example:

"queued"

jobId

string

New job ID to poll for results

message

string

Get Crawl Job Status Get Crawled Pages

​Prerequisites

​Request Body

​How It Works

​When to Use This

​Polling Results

​Common Errors

Authorizations

Path Parameters

Body

Response

Prerequisites

Request Body

How It Works

When to Use This

Polling Results

Common Errors