Ruby

The Ruby SDK wraps the Spidra API so you’re not manually firing HTTP requests and writing polling loops from scratch. It handles job submission, status polling, and error mapping, and it pulls in zero external dependencies. Everything runs on the standard library.

Installation

gem install spidra

Or add it to your Gemfile:

gem "spidra"

Requires Ruby 2.7 or higher.

Get your API key from app.spidra.io under Settings → API Keys. Store it as an environment variable. Never hardcode it.

Getting started

require "spidra"

client = Spidra.new(ENV["SPIDRA_API_KEY"])

From here you access everything through client.scrape, client.batch, client.crawl, client.logs, and client.usage. If you need to point at a different host or change the HTTP timeout, pass those as keyword arguments:

client = Spidra.new(
  ENV["SPIDRA_API_KEY"],
  base_url: "http://localhost:4321/api",
  timeout:  60
)

Scraping

The scraper accepts up to three URLs per request and processes them in parallel. You can pass a plain extraction prompt, a full JSON schema, per-URL browser actions, or any mix of those. The simplest path is run — it submits the job and blocks until it finishes, then returns the result:

job = client.scrape.run(
  urls:   [{ url: "https://example.com/pricing" }],
  prompt: "Extract all pricing plans with name, price, and included features",
  output: "json"
)

puts job["result"]["content"]
# {"plans" => [{"name" => "Starter", "price" => "$9/mo", ...}]}

If you’d rather fire and move on, submit returns a job ID immediately and you call get whenever you’re ready to check:

response = client.scrape.submit(
  urls:   [{ url: "https://example.com" }],
  prompt: "Extract the main headline"
)
job_id = response["jobId"]

# Later...
status = client.scrape.get(job_id)

if status["status"] == "completed"
  puts status["result"]["content"]
end

Job statuses move through: waiting → active → completed (or failed).

Scrape parameters

Parameter	Type	Description
`urls`	Array	Up to 3 URLs. Each entry is `{ url: "..." }` with optional `actions:`
`prompt`	String	What to extract, written in plain English
`output`	String	`"markdown"` (default) or `"json"`
`schema`	Hash	JSON Schema that forces a specific output shape
`useProxy`	Boolean	Route through a residential proxy
`proxyCountry`	String	Two-letter country code: `"us"`, `"de"`, `"jp"`, etc.
`extractContentOnly`	Boolean	Strip nav, ads, and boilerplate before the AI sees the page
`screenshot`	Boolean	Capture a viewport screenshot
`fullPageScreenshot`	Boolean	Capture a full-page scrolled screenshot
`cookies`	String	Raw `Cookie` header string for pages behind a login

Enforcing an exact output shape

Without a schema the AI extracts what it finds. With a schema, missing fields come back as null rather than guessed values, which matters when the output feeds a database or a typed pipeline downstream:

job = client.scrape.run(
  urls:   [{ url: "https://jobs.example.com/senior-engineer" }],
  prompt: "Extract the job listing details",
  output: "json",
  schema: {
    type:     "object",
    required: ["title", "company", "remote"],
    properties: {
      title:      { type: "string" },
      company:    { type: "string" },
      remote:     { type: ["boolean", "null"] },
      salary_min: { type: ["number", "null"] },
      skills:     { type: "array", items: { type: "string" } }
    }
  }
)

Scraping geo-restricted content

Some sites serve different prices or content depending on where you’re browsing from. Set useProxy and proxyCountry to route through a residential IP in that country:

job = client.scrape.run(
  urls:         [{ url: "https://www.amazon.de/gp/bestsellers" }],
  prompt:       "List the top 10 products with name and price",
  useProxy:     true,
  proxyCountry: "de"
)

Supported country codes include us, gb, de, fr, jp, au, ca, br, in, nl, and 40+ more. Use "global" or "eu" for regional routing without pinning to a specific country. If the page requires a session, pass your cookies as a raw header string. The easiest way to get this is to log in through your browser, open devtools, and copy the Cookie header from any authenticated request:

job = client.scrape.run(
  urls:    [{ url: "https://app.example.com/dashboard" }],
  prompt:  "Extract the monthly revenue and active user count",
  cookies: "session=abc123; auth_token=xyz789"
)

Browser actions

Sometimes you need to interact with the page before extraction — dismiss a cookie banner, type into a search box, scroll to load lazy content. Pass an actions array inside the URL entry and they run in order before the AI sees the page:

job = client.scrape.run(
  urls: [
    {
      url:     "https://example.com/products",
      actions: [
        { type: "click",  selector: "#accept-cookies" },
        { type: "wait",   duration: 1000 },
        { type: "scroll", to: "80%" }
      ]
    }
  ],
  prompt: "Extract all product names and prices visible on the page"
)

For selector you can pass a CSS selector or XPath. If you’d rather describe the element in plain English, use value and Spidra will locate it with AI.

Action	What it does
`click`	Click any element — use `selector` for CSS, `value` for plain text
`type`	Type into an input or textarea
`check`	Check a checkbox
`uncheck`	Uncheck a checkbox
`wait`	Pause for `duration` milliseconds
`scroll`	Scroll to a percentage of the page height, e.g. `"80%"`
`forEach`	Loop over every matched element and extract from each one

Controlling how long run waits

By default run polls every 3 seconds and gives up after 120 seconds. You can override both by passing keyword arguments after the params hash:

job = client.scrape.run(
  { urls: [{ url: "https://example.com" }], prompt: "..." },
  poll_interval: 5,  # seconds between checks
  timeout: 60        # give up after this many seconds
)

The same options work on batch.run and crawl.run.

Batch scraping

When you have a list of URLs to process, batch is the right tool. You can submit up to 50 URLs in a single request and they all run in parallel. Unlike the scraper, each URL here is a plain string — there’s no per-URL actions support in batch mode.

batch = client.batch.run(
  urls: [
    "https://shop.example.com/product/1",
    "https://shop.example.com/product/2",
    "https://shop.example.com/product/3"
  ],
  prompt: "Extract product name, price, and whether it is in stock",
  output: "json"
)

puts "#{batch["completedCount"]}/#{batch["totalUrls"]} completed"

batch["items"].each do |item|
  if item["status"] == "completed"
    puts item["result"].inspect
  else
    puts "Failed: #{item["url"]} — #{item["error"]}"
  end
end

Each item moves through pending → running → completed (or failed). The batch itself follows the same lifecycle, plus a cancelled state if you stop it early. If you don’t want to wait for the whole thing to finish, use submit and get separately:

response = client.batch.submit(
  urls:   ["https://example.com/1", "https://example.com/2"],
  prompt: "Extract the page title and meta description"
)
batch_id = response["batchId"]

# Come back later
result = client.batch.get(batch_id)
puts "#{result["completedCount"]} of #{result["totalUrls"]} done"

Retrying failures and cancelling

If some items fail due to timeouts or transient errors, you can retry just those without re-running the ones that already succeeded:

if batch["failedCount"] > 0
  client.batch.retry(batch_id)
end

To stop a running batch and get credits back for anything that hasn’t started yet:

client.batch.cancel(batch_id)

To look through past batches:

page = client.batch.list(1, 20) # page, limit

page["jobs"].each do |job|
  puts "#{job["uuid"]} #{job["status"]} — #{job["completedCount"]}/#{job["totalUrls"]}"
end

Crawling

Crawling is different from scraping. You give it a starting URL and it discovers and processes pages on its own, following links according to your instructions. Good for indexing a docs site, monitoring a competitor’s blog, or building a structured dataset from an entire section of a site.

job = client.crawl.run(
  {
    baseUrl:              "https://competitor.com/blog",
    crawlInstruction:     "Follow links to blog posts only, skip tag pages, category pages, and the homepage",
    transformInstruction: "Extract the post title, author name, publish date, and a one-sentence summary",
    maxPages:             30,
    useProxy:             true
  },
  timeout: 360
)

job["result"].each do |page|
  puts "#{page["url"]}: #{page["data"].inspect}"
end

crawlInstruction tells the crawler which links to follow. transformInstruction tells the AI what to extract from each page it visits. maxPages is a safety cap so the crawl doesn’t run indefinitely. The default timeout for crawl.run is 300 seconds — pass a higher value for bigger crawls. The same useProxy, proxyCountry, and cookies options from the scraper work here too. Just like scraping, you can fire-and-forget with submit and poll with get:

response = client.crawl.submit(
  baseUrl:              "https://example.com/docs",
  crawlInstruction:     "Follow all documentation pages",
  transformInstruction: "Extract the page title and a short summary of the content",
  maxPages:             50
)
job_id = response["jobId"]

status = client.crawl.get(job_id)
# status moves through: waiting → active → running → completed (or failed)

Downloading the raw content

Once a crawl completes, you can fetch signed URLs to download the raw HTML and Markdown for every page that was crawled. These links expire after an hour:

result = client.crawl.pages(job_id)

result["pages"].each do |page|
  # page["html_url"]     — download the raw HTML
  # page["markdown_url"] — download the cleaned Markdown
  puts "#{page["url"]} — #{page["status"]}"
end

Re-extracting with a different prompt

If you crawled a site and want to pull out different information, you don’t have to re-crawl. extract runs a new AI pass over the already-crawled content and charges only transformation credits:

result = client.crawl.extract(
  completed_job_id,
  "Extract only product SKUs and prices as structured JSON"
)

# This creates a new job — poll it like any other
extracted = client.crawl.get(result["jobId"])

Browsing your crawl history

history = client.crawl.history(1, 10)
puts "Total crawl jobs: #{history["total"]}"

stats = client.crawl.stats
puts "All-time: #{stats["total"]}"

Logs

Every scrape request your API key makes gets logged automatically. You can filter by status, URL, date range, or where it came from:

result = client.logs.list(
  status:     "failed",
  searchTerm: "amazon.com",
  dateStart:  "2024-01-01",
  dateEnd:    "2024-12-31",
  page:       1,
  limit:      20
)

result.dig("data", "logs").each do |log|
  puts "#{log["urls"][0]["url"]} — #{log["status"]} (#{log["credits_used"]} credits)"
end

To fetch the full details of a single log entry including the AI extraction output:

log = client.logs.get(log_uuid)
puts log.inspect

Usage statistics

Check how many requests and credits your account has used over a given period:

result = client.usage.get("30d") # "7d" | "30d" | "weekly"

result["data"].each do |row|
  puts "#{row["date"]}: #{row["requests"]} requests, #{row["credits"]} credits"
end

"7d" gives one row per day for the last week. "30d" gives the last month. "weekly" gives one row per week for the last seven weeks.

Error handling

Every API error is mapped to a typed exception class, so you can rescue exactly what you care about and let the rest bubble up:

begin
  job = client.scrape.run(
    urls:   [{ url: "https://example.com" }],
    prompt: "Extract the main headline"
  )
rescue Spidra::AuthenticationError
  # Bad or missing API key
rescue Spidra::InsufficientCreditsError
  # Account is out of credits — time to top up
rescue Spidra::RateLimitError
  # Slow down — you're hitting limits
rescue Spidra::ServerError => e
  # Something went wrong on Spidra's side — retry is usually safe
  puts "Server error (#{e.status}): #{e.message}"
rescue Spidra::Error => e
  # Catch-all for anything else
  puts "API error #{e.status}: #{e.message}"
end

Exception	Status	When
`Spidra::AuthenticationError`	401	The API key is missing or invalid
`Spidra::InsufficientCreditsError`	403	No credits remaining on the account
`Spidra::RateLimitError`	429	Too many requests — back off
`Spidra::ServerError`	500	Unexpected server-side error
`Spidra::Error`	any	Base class for all Spidra exceptions

All exceptions expose .status for the HTTP status code and .message for a human-readable explanation.

Go

Official Go SDK — typed structs, idiomatic errors, zero external dependencies.

Python

Official Python SDK — async-first with sync wrappers. Works in scripts, Django, Flask, and Jupyter.

Overview

Official SDKs

AI Frameworks

Installation

Getting started

Scraping

Scrape parameters

Enforcing an exact output shape

Scraping geo-restricted content

Browser actions

Controlling how long run waits

Batch scraping

Retrying failures and cancelling

Crawling

Downloading the raw content

Re-extracting with a different prompt

Browsing your crawl history

Logs

Usage statistics

Error handling

Go

Python

​Installation

​Getting started

​Scraping

​Scrape parameters

​Enforcing an exact output shape

​Scraping geo-restricted content

​Scraping pages behind a login

​Browser actions

​Controlling how long run waits

​Batch scraping

​Retrying failures and cancelling

​Crawling

​Downloading the raw content

​Re-extracting with a different prompt

​Browsing your crawl history

​Logs

​Usage statistics

​Error handling

Go

Python

Installation

Getting started

Scraping

Scrape parameters

Enforcing an exact output shape

Scraping geo-restricted content

Scraping pages behind a login

Browser actions

Controlling how long run waits

Batch scraping

Retrying failures and cancelling

Crawling

Downloading the raw content

Re-extracting with a different prompt

Browsing your crawl history

Logs

Usage statistics

Error handling