Re-Extract from Existing Crawl
Crawl Endpoints
Extract from Crawl
Run a new extraction on pages from a completed crawl job without re-crawling the site
POST
Re-Extract from Existing Crawl
Prerequisites
Request Body
| Field | Type | Required | Description |
|---|---|---|---|
transformInstruction | string | Yes | The extraction prompt to apply to every page from the source crawl. Maximum 5,000 characters. |
How It Works
- Pass the
jobIdof a completed crawl job. If you ran the crawl previously, this is theidfield shown in your crawl history. - Provide a
transformInstructiondescribing what you want to extract. - Spidra loads the saved content for each page and runs your prompt against it.
- A new crawl job is created with the results, which you can poll and download the same way as any other job.
When to Use This
- You want to extract different fields from pages you already crawled
- Your first extraction prompt wasn’t quite right and you want to try again
- You need the same pages in two different formats, like JSON and CSV
Polling Results
The response returns a newjobId. Use the standard crawl endpoints to check progress and get results:
| Endpoint | Purpose |
|---|---|
GET /crawl/{jobId} | Poll job status |
GET /crawl/{jobId}/pages | Get extracted data per page |
GET /crawl/{jobId}/download | Download results as ZIP |
POST /crawl/{jobId}/retry/{pageId} | Retry a specific page |
Common Errors
| Status | Error message | Cause |
|---|---|---|
400 | Source crawl job has not completed successfully | You called /extract before the source job finished. Wait for status: "completed". |
400 | Missing required field: transformInstruction | The request body is missing the transformInstruction field. |
400 | transformInstruction must be 5000 characters or fewer | Your prompt exceeds the 5,000 character limit. |
403 | You have exceeded your monthly credit limit. | Not enough credits remaining. Check your usage at GET /usage. |
404 | Source crawl job not found | The jobId does not exist or does not belong to your account. |
Authorizations
Path Parameters
The ID of the completed source crawl job to extract from
Body
application/json
Extraction prompt to apply to all pages from the source crawl. Maximum 5,000 characters.
Maximum string length:
5000
