Here is how it works:
- Enter your Target URL - This is the starting point for the crawl (e.g.,
https://example.com/blog)

- Set crawl instructions - This tells Spidra which pages to follow from the target URL. Spidra uses these instructions to decide which links to follow and which ones to ignore. This helps prevent crawling irrelevant pages like navigation links, login pages, or unrelated sections of the website.
- “Crawl all blog post pages”
- “Visit product pages only”
- “Follow links in the /docs/ section”

- Set transform instructions - The transform instructions describe what information Spidra should extract from each page it visits.
- “Extract the title, author, and publish date”
- “Get the product name, price, and description”
- “Pull the main article content”

- Set max pages - You can limit how many pages Spidra is allowed to crawl. Maximum number of pages to crawl (1-50), higher numbers may take longer to process
- Running small experiments
- Controlling token usage
- Validating your crawl and transform instructions before expanding

- Run the crawl - Spidra will discover pages, visit each one, and extract the data you requested
Features
Some of the features that comes with Crawl include:- Smart Page Discovery: Spidra uses your crawl instructions to intelligently find relevant pages. It won’t waste time on navigation menus or unrelated links.
- AI-Powered Extraction: Each page is processed with AI to extract exactly the data you described. No CSS selectors or XPath needed.
- Automatic CAPTCHA Solving: If a page presents a CAPTCHA or bot challenge during the crawl, Spidra handles it automatically. This allows the crawl to continue without manual intervention, which is useful when working with multiple pages.
- Proxy Support: Enable stealth mode for sites that block scrapers. When enabled, Spidra routes requests through stealth infrastructure to reduce blocks, rate limits, and detection.
- Download Results: After crawling completes, you can download all extracted data as a ZIP file containing: Transformed data (your extracted content), raw markdown of each page, original HTML snapshots
- Retry Failed Pages: If AI extraction fails on specific pages, you can retry them individually without re-crawling the entire site.
Tips for Best Results
- Be specific with crawl instructions: “All blog posts from 2024” works much better than “crawl everything.” Clear constraints lead to cleaner results.
- Start small: Test with 3–5 pages first to validate your crawl and transform instructions before scaling up.
- Choose the right target URL: Make sure your target URL actually links to the pages you want to crawl. Index, category, or listing pages are usually ideal.

