Skip to main contentThe Crawling Playground lets you crawl multiple pages from a website and extract structured data from each page using natural language instructions. It is designed for experimenting with large-scale crawling, validating extraction prompts, and understanding how the crawler behaves across different page layouts and content types.
Here is how it works:
- Enter your Target URL - This is the starting point for the crawl (e.g.,
https://example.com/blog)
- Set crawl instructions - This tells Spidra which pages to follow from the target URL. Spidra uses these instructions to decide which links to follow and which ones to ignore. This helps prevent crawling irrelevant pages like navigation links, login pages, or unrelated sections of the website.
You describe this in plain language. For example:
- “Crawl all blog post pages”
- “Visit product pages only”
- “Follow links in the /docs/ section”
The clearer your instructions, the more focused the crawl will be.
- Set transform instructions - The transform instructions describe what information Spidra should extract from each page it visits.
This is where you define the structure of your data. For example:
- “Extract the title, author, and publish date”
- “Get the product name, price, and description”
- “Pull the main article content”
These instructions are applied consistently to every page in the crawl. If pages share a similar layout, you’ll get clean, structured results across all of them. But if you have different layouts, this step helps you to quickly test and refine your extraction prompt before crawling.
- Set max pages - You can limit how many pages Spidra is allowed to crawl. Maximum number of pages to crawl (1-50), higher numbers may take longer to process
You will find this useful if you are:
- Running small experiments
- Controlling token usage
- Validating your crawl and transform instructions before expanding
- Run the crawl - Spidra will discover pages, visit each one, and extract the data you requested
Features
Some of the features that comes with Crawl include:
-
Smart Page Discovery: Spidra uses your crawl instructions to intelligently find relevant pages. It won’t waste time on navigation menus or unrelated links.
-
AI-Powered Extraction: Each page is processed with AI to extract exactly the data you described. No CSS selectors or XPath needed.
-
Automatic CAPTCHA Solving: If a page presents a CAPTCHA or bot challenge during the crawl, Spidra handles it automatically. This allows the crawl to continue without manual intervention, which is useful when working with multiple pages.
-
Proxy Support: Enable stealth mode for sites that block scrapers. When enabled, Spidra routes requests through stealth infrastructure to reduce blocks, rate limits, and detection.
-
Download Results: After crawling completes, you can download all extracted data as a ZIP file containing: Transformed data (your extracted content), raw markdown of each page, original HTML snapshots
-
Retry Failed Pages: If AI extraction fails on specific pages, you can retry them individually without re-crawling the entire site.
Tips for Best Results
-
Be specific with crawl instructions: “All blog posts from 2024” works much better than “crawl everything.” Clear constraints lead to cleaner results.
-
Start small: Test with 3–5 pages first to validate your crawl and transform instructions before scaling up.
-
Choose the right target URL: Make sure your target URL actually links to the pages you want to crawl. Index, category, or listing pages are usually ideal.