> ## Documentation Index
> Fetch the complete documentation index at: https://docs.spidra.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Submit a Crawl Job

> Start a crawl job that discovers and processes multiple pages from a website. Control which pages to visit, how deep to go, and what to extract from each one.

## How It Works

Crawl jobs run asynchronously. You get a `jobId` immediately and poll `GET /crawl/{jobId}` until the job finishes.

1. **Submit** — Send your request, receive a `jobId`
2. **Discover** — Spidra loads your base URL and finds links matching your `crawlInstruction`
3. **Crawl** — Visits each discovered page (up to `maxPages`)
4. **Solve** — Handles any CAPTCHAs automatically
5. **Extract** — Runs your `transformInstruction` or `schema` on each page. If neither is provided, returns raw page markdown with no AI step.
6. **Poll** — Check `GET /crawl/{jobId}` until `status` is `completed`

***

## Request Fields

### Required

| Field              | Type   | Description                                                                                                                                                                        |
| ------------------ | ------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `baseUrl`          | string | The starting URL. Spidra begins here and follows links outward.                                                                                                                    |
| `crawlInstruction` | string | Which pages to follow, in plain language. e.g. `"Crawl all blog post pages"` or `"Visit product pages only"`. Spidra uses this to decide which links to visit and which to ignore. |

### Extraction (both optional)

| Field                  | Type   | Description                                                                                                                                                                                                                                                                    |
| ---------------------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `transformInstruction` | string | What to extract from each page, in plain language. Applied to every page in the crawl. When omitted and no `schema` is set, each page's `data` field contains the raw page markdown — no AI is called and no token credits are charged.                                        |
| `schema`               | object | A JSON Schema object defining the exact output structure. When provided, the AI returns JSON matching this schema for every page, which is useful when you need consistent, queryable results across all pages. Takes precedence over `transformInstruction` for output shape. |

### Scope and depth

| Field               | Type      | Default   | Description                                                                                                                                                                 |
| ------------------- | --------- | --------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `maxPages`          | integer   | `5`       | Maximum number of pages to crawl (1–50).                                                                                                                                    |
| `maxDepth`          | integer   | unlimited | Maximum link depth from the base URL. `0` visits only the base URL. `1` visits the base URL and pages directly linked from it. Omit for unlimited depth.                    |
| `includePaths`      | string\[] | —         | URL path patterns to include. Only pages whose paths match at least one pattern are crawled. Accepts glob-style patterns, e.g. `["/blog/*", "/news/*"]`.                    |
| `excludePaths`      | string\[] | —         | URL path patterns to skip. Pages matching any pattern are not visited. e.g. `["/tag/*", "/author/*", "/login"]`.                                                            |
| `allowSubdomains`   | boolean   | `false`   | When `true`, follows links to subdomains of the base domain (e.g. `docs.example.com` when base is `example.com`).                                                           |
| `crawlEntireDomain` | boolean   | `false`   | When `true`, follows any link on the same root domain regardless of path. Combine with `includePaths` or `excludePaths` to keep the scope manageable.                       |
| `ignoreQueryParams` | boolean   | `false`   | When `true`, URLs that differ only by query string are treated as the same page. Prevents duplicate processing on sites that append tracking or session parameters to URLs. |

### Delivery

| Field        | Type   | Description                                                                                                                                                                                  |
| ------------ | ------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `webhookUrl` | string | A URL that receives POST requests as the job progresses. Spidra fires one event per successful page, plus a final event when the job completes. See [Webhook events](#webhook-events) below. |

### Access and authentication

| Field          | Type    | Default    | Description                                                                                                                                            |
| -------------- | ------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `useProxy`     | boolean | `false`    | Route requests through residential proxies to reduce bot detection and access geo-restricted content.                                                  |
| `proxyCountry` | string  | `"global"` | Two-letter ISO country code (`"us"`, `"de"`), `"eu"` for rotation across EU member states, or `"global"` for no preference. Requires `useProxy: true`. |
| `cookies`      | string  | —          | Session cookies for crawling pages that require a login. Accepts standard format (`name=value; name2=value2`) or a raw Chrome DevTools paste.          |

***

## Proxy and Geo-Targeting

```json theme={null}
{
  "baseUrl": "https://example.com/products",
  "crawlInstruction": "Crawl all product listing pages",
  "transformInstruction": "Extract product name, price, and availability",
  "useProxy": true,
  "proxyCountry": "us"
}
```

Use `"proxyCountry": "global"` (or omit it) for no country preference. Use `"eu"` to rotate across EU member states. For a specific country pass its two-letter ISO code.

<Card title="Stealth Mode and Geo-Targeting" icon="shield-halved" href="/features/stealth-mode">
  Full country list, EU rotation, examples, and credit costs
</Card>

***

## Scoped Crawling with Path Filters

Combine `includePaths` and `excludePaths` to keep crawls focused on the content you actually need.

```json theme={null}
{
  "baseUrl": "https://example.com",
  "crawlInstruction": "Crawl all documentation pages",
  "transformInstruction": "Extract the page title and main content",
  "maxPages": 30,
  "includePaths": ["/docs/*"],
  "excludePaths": ["/docs/changelog/*", "/docs/legacy/*"]
}
```

***

## Structured Output with Schema

Use `schema` when you need every page to return the same fields in the same format — useful for feeding results directly into a database or downstream pipeline.

<Tip>
  Use the [Spidra JSON Schema Generator](https://spidra.io/tools/json-schema-generator) to build and preview your schema visually before pasting it here.
</Tip>

```json theme={null}
{
  "baseUrl": "https://example.com/jobs",
  "crawlInstruction": "Crawl all job listing pages",
  "schema": {
    "type": "object",
    "properties": {
      "title": { "type": "string" },
      "location": { "type": "string" },
      "salary": { "type": "string" },
      "remote": { "type": "boolean" }
    }
  },
  "maxPages": 20
}
```

***

## Raw Content (No Extraction)

Omit both `transformInstruction` and `schema` to get the raw page content without any AI processing. Each page's `data` field contains the plain markdown of that page. No token credits are charged.

```json theme={null}
{
  "baseUrl": "https://docs.example.com",
  "crawlInstruction": "Crawl all documentation pages",
  "maxPages": 20
}
```

This is useful when you want to feed the content into your own AI pipeline or process it downstream.

***

## Webhook Events

When `webhookUrl` is set, Spidra sends POST requests to that URL as the job runs. All requests have `Content-Type: application/json`.

**`crawl.page`** — Fired for each page that is successfully processed.

```json theme={null}
{
  "event": "crawl.page",
  "jobId": "abc-123",
  "page": {
    "url": "https://example.com/blog/post-1",
    "title": "My Blog Post",
    "data": { "title": "My Blog Post", "author": "Jane Smith" }
  }
}
```

**`crawl.completed`** — Fired once when the entire job finishes.

```json theme={null}
{
  "event": "crawl.completed",
  "jobId": "abc-123",
  "pagesScraped": 8,
  "creditsUsed": 22
}
```

**`crawl.failed`** — Fired if the job fails entirely (not for individual page failures).

```json theme={null}
{
  "event": "crawl.failed",
  "jobId": "abc-123",
  "error": "No pages successfully scraped"
}
```

Webhook delivery is fire-and-forget. Spidra does not retry on failure and does not block the crawl if your endpoint is slow or unreachable. If you need guaranteed delivery, poll [GET /crawl/{jobId}/pages](/api-reference/crawling/crawl-pages) when the job completes.

***

## Authenticated Crawling

```json theme={null}
{
  "baseUrl": "https://app.example.com/dashboard",
  "crawlInstruction": "Find all report pages",
  "transformInstruction": "Extract report titles and dates",
  "maxPages": 10,
  "cookies": "session_id=abc123; auth_token=xyz789"
}
```

<Card title="Authenticated Scraping" icon="lock-open" href="/features/authenticated-scraping">
  Full guide on getting cookies and formats
</Card>


## OpenAPI

````yaml POST /crawl
openapi: 3.1.0
info:
  title: Spidra API
  version: 1.0.0
  description: >-
    Public API endpoints for web scraping via Spidra. Authenticate with
    `Authorization: Bearer YOUR_API_KEY`.
servers:
  - url: https://api.spidra.io/api
security:
  - BearerAuth: []
  - ApiKeyAuth: []
paths:
  /crawl:
    post:
      tags:
        - Crawling
      summary: Submit a Crawl Job
      requestBody:
        required: true
        content:
          application/json:
            schema:
              type: object
              required:
                - baseUrl
                - crawlInstruction
              properties:
                baseUrl:
                  type: string
                  format: uri
                  description: >-
                    The starting URL for the crawl. Spidra begins here and
                    follows links outward.
                crawlInstruction:
                  type: string
                  description: >-
                    Plain-language instruction for which pages to follow (e.g.,
                    'all product pages', 'blog posts only'). Spidra uses this to
                    decide which links to visit and which to ignore.
                transformInstruction:
                  type: string
                  description: >-
                    Plain-language instruction for what to extract from each
                    page (e.g., 'Extract title, author, and publish date'). When
                    omitted and no schema is provided, each page's data field
                    contains the raw page markdown — no AI is used and no token
                    credits are charged.
                schema:
                  type: object
                  description: >-
                    A JSON Schema object that defines the exact structure of the
                    extracted data. When provided, the AI returns JSON matching
                    this schema for every page. Use this when you need
                    consistent, queryable output across all pages. Root must be
                    type 'object'. Takes precedence over transformInstruction
                    for output shape.
                maxPages:
                  type: integer
                  default: 5
                  minimum: 1
                  maximum: 50
                  description: Maximum number of pages to crawl.
                maxDepth:
                  type: integer
                  minimum: 0
                  description: >-
                    Maximum link depth from the base URL. 0 means only the base
                    URL itself is visited. 1 means the base URL and pages
                    directly linked from it. Omit for unlimited depth.
                includePaths:
                  type: array
                  items:
                    type: string
                  description: >-
                    URL path patterns to include. Only pages whose paths match
                    at least one pattern will be crawled. Use glob-style
                    patterns (e.g., '/blog/*', '/products/*'). Takes effect
                    after crawlInstruction filtering.
                excludePaths:
                  type: array
                  items:
                    type: string
                  description: >-
                    URL path patterns to exclude. Pages matching any pattern are
                    skipped entirely (e.g., '/admin/*', '/login', '/tag/*').
                allowSubdomains:
                  type: boolean
                  default: false
                  description: >-
                    When true, the crawler follows links to subdomains of the
                    base URL (e.g., docs.example.com when base is example.com).
                crawlEntireDomain:
                  type: boolean
                  default: false
                  description: >-
                    When true, the crawler follows any link on the same root
                    domain regardless of the starting path. Use with
                    includePaths or excludePaths to keep it focused.
                ignoreQueryParams:
                  type: boolean
                  default: false
                  description: >-
                    When true, URLs that differ only by query string are treated
                    as the same page. Prevents duplicate crawling on sites that
                    append tracking or session parameters to URLs.
                webhookUrl:
                  type: string
                  format: uri
                  description: >-
                    A URL that receives a POST request each time a page finishes
                    processing. Useful for streaming results into your own
                    pipeline instead of polling at the end.
                useProxy:
                  type: boolean
                  default: false
                  description: >-
                    Route requests through residential proxies to reduce bot
                    detection and access geo-restricted content.
                proxyCountry:
                  type: string
                  description: >-
                    Two-letter ISO country code (e.g., 'us', 'de'), 'eu' for EU
                    rotation, or 'global' for no preference. Requires useProxy:
                    true.
                cookies:
                  type: string
                  description: >-
                    Session cookies for crawling pages that require
                    authentication. Accepts standard cookie string format
                    (name=value; name2=value2) or a raw Chrome DevTools paste.
            examples:
              blog_crawl:
                summary: Crawl a blog and extract structured data
                value:
                  baseUrl: https://example.com/blog
                  crawlInstruction: Crawl all blog post pages
                  transformInstruction: >-
                    Extract the title, author, publish date, and a one-sentence
                    summary
                  maxPages: 10
                  maxDepth: 2
                  includePaths:
                    - /blog/*
                  excludePaths:
                    - /blog/tag/*
                    - /blog/author/*
              schema_extraction:
                summary: Structured output with JSON Schema
                value:
                  baseUrl: https://jobs.example.com
                  crawlInstruction: Crawl all job listing pages
                  schema:
                    type: object
                    properties:
                      title:
                        type: string
                      location:
                        type: string
                      salary:
                        type:
                          - string
                          - 'null'
                      remote:
                        type: boolean
                  maxPages: 25
              raw_content:
                summary: Collect raw page content without AI extraction
                value:
                  baseUrl: https://docs.example.com
                  crawlInstruction: Crawl all documentation pages
                  maxPages: 50
                  includePaths:
                    - /docs/*
              authenticated:
                summary: Crawl pages behind a login
                value:
                  baseUrl: https://app.example.com/reports
                  crawlInstruction: Crawl all report pages
                  transformInstruction: Extract the report title, date, and key metrics
                  maxPages: 15
                  cookies: session_id=abc123; auth_token=xyz789
              geo_targeted:
                summary: Geo-targeted crawl with proxy
                value:
                  baseUrl: https://example.co.uk/products
                  crawlInstruction: Crawl all product pages
                  transformInstruction: Extract product name, price in GBP, and availability
                  maxPages: 20
                  useProxy: true
                  proxyCountry: gb
      responses:
        '202':
          description: Crawl job queued
          content:
            application/json:
              schema:
                type: object
                properties:
                  status:
                    type: string
                    enum:
                      - queued
                  jobId:
                    type: string
                  message:
                    type: string
              example:
                status: queued
                jobId: 550e8400-e29b-41d4-a716-446655440000
                message: >-
                  Crawl job queued. Poll
                  /api/crawl/550e8400-e29b-41d4-a716-446655440000 for results.
        '400':
          description: Invalid request
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
              example:
                status: error
                message: 'Missing required fields: baseUrl, crawlInstruction'
        '403':
          description: Quota exceeded
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
components:
  schemas:
    ErrorResponse:
      type: object
      properties:
        status:
          type: string
          enum:
            - error
        message:
          type: string
      required:
        - status
        - message
  securitySchemes:
    BearerAuth:
      type: http
      scheme: bearer
      bearerFormat: API key
    ApiKeyAuth:
      type: apiKey
      in: header
      name: x-api-key

````