- Published on
Web Scraping & Crawling Tools in 2026 — Scrapy / Playwright / Puppeteer / Crawlee (Apify) / Firecrawl / Jina Reader / Stagehand AI Deep Dive
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Prologue — In 2026, Scraping Became an LLM Training-Data Problem
Web scraping used to be a gray-area task that marketing teams ran to monitor competitor prices. In 2026 it is different. Every LLM training run needs web data, and scraping has become the primary input layer of AI infrastructure. One Common Crawl dump is no longer enough.
This article surveys the web scraping tools, services and strategies that are alive in May 2026. We start with Python, move into headless browsers, pass through proxy clouds and API services, and end with LLM-friendly APIs and AI agent browsers — so you can decide which tool fits which job.
1. The 2026 Web Scraping Map — Four Layers
Scraping tools split into four layers.
[Layer 1: libraries / frameworks — you write the code]
Scrapy / Playwright / Puppeteer / Selenium / Crawlee
Cheerio / Beautiful Soup 4 / lxml / jsoup / Goutte
[Layer 2: cloud / managed — SaaS]
Apify / Bright Data / Oxylabs / Smartproxy / SOAX
[Layer 3: API services — one HTTP call]
ScrapingBee / Browserless / ZenRows / ScrapingAnt / ScraperAPI
[Layer 4: AI / LLM-friendly — Markdown output, agents]
Firecrawl / Jina AI Reader / Diffbot
Stagehand / Browser Use / AnchorBrowser
The decision rule is simple.
- Learning / small projects → Scrapy / Playwright / Cheerio directly.
- Tens of sites, anti-bot bypass → Crawlee plus proxies.
- Hundreds of sites, low ops burden → Apify / ScrapingBee.
- LLM training / RAG indexing → Firecrawl / Jina Reader.
- AI agent driving the browser → Stagehand / Browser Use.
We will walk through each tool in turn.
2. Scrapy — The Python Classic (Alive Since 2008)
Scrapy was first released in 2008 as a Python scraping framework. As of May 2026 it is at version 2.13 and remains the most-downloaded scraping library by far — over 14 million monthly downloads on PyPI.
The core architecture is an async event loop (Twisted, with asyncio migration in progress) plus spiders plus pipelines. A spider puts URLs into a queue, a parse function is called on each response, and extracted items flow through pipelines into storage.
import scrapy
class HackerNewsSpider(scrapy.Spider):
name = "hn"
start_urls = ["https://news.ycombinator.com/"]
def parse(self, response):
for item in response.css("tr.athing"):
yield {
"title": item.css("span.titleline > a::text").get(),
"url": item.css("span.titleline > a::attr(href)").get(),
"rank": item.css("span.rank::text").get(),
}
next_page = response.css("a.morelink::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)
scrapy crawl hn -o items.jsonl handles pagination, item extraction and JSONL output in one line.
Scrapy's strengths are its tunable queues (Bloom-filter dupefilter, priority queue), automatic retry / backoff, AutoThrottle, middleware chain, Items + Item Loaders + ItemAdapter, and the scrapy-playwright / scrapy-splash integrations. Its weakness is that JavaScript rendering is not built in. That is why scrapy-playwright integration has become the de facto standard.
Between 2024 and 2025 Scrapy underwent a major refactor that added native asyncio support (twisted.internet.asyncioreactor), and it is compatible with Python 3.13's no-GIL build. It will remain the default choice for data-engineering pipelines in 2026.
3. Playwright (Microsoft) — The Headless Browser Standard
Playwright is Microsoft's headless-browser automation tool, first released in 2020. As of May 2026 it is at version 1.50 and has become the de facto standard for headless browsers. It has more than 71,000 GitHub stars and has long since overtaken Puppeteer.
The key differentiator is that Playwright controls Chrome, Firefox and WebKit through a single API. Puppeteer is Chromium-centric; Playwright was designed multi-browser from day one.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(
user_agent="Mozilla/5.0 ...",
viewport={"width": 1920, "height": 1080},
locale="ko-KR",
)
page = context.new_page()
page.goto("https://example.com/list")
page.wait_for_selector("article.post")
items = page.eval_on_selector_all(
"article.post",
"elements => elements.map(e => ({title: e.querySelector('h2').innerText, link: e.querySelector('a').href}))"
)
print(items)
browser.close()
There are Python, Node, Java and .NET bindings. The Python binding sees the most use in data teams.
Notable Playwright features as of 2026:
- Auto-wait —
page.click("button")automatically waits until the button is clickable. No moretime.sleephell. - Trace Viewer — replay every click, network call, and DOM state on a timeline. One of the best browser-automation debuggers around.
- MCP integration — the 2025 Playwright MCP server lets AI agents (Claude, Cursor, others) drive the browser directly.
- Codegen — record your interactions and emit code, useful for bootstrapping a scraper.
- Component testing (a separate library) — test React or Vue components in a real browser.
For scraping, the standard combination is playwright-extra plus puppeteer-extra-plugin-stealth (Node), or playwright-stealth (Python).
4. Puppeteer (Google) — The Original Chromium Automation
Puppeteer is the Node.js library Google's Chrome team released in 2017. It was once synonymous with headless browsing, but has ceded ground to Playwright in the 2020s. As of May 2026 it is at v23 and is still actively maintained.
import puppeteer from 'puppeteer'
const browser = await puppeteer.launch({ headless: 'new' })
const page = await browser.newPage()
await page.goto('https://example.com/list')
await page.waitForSelector('article.post')
const items = await page.$$eval('article.post', (nodes) =>
nodes.map((n) => ({
title: n.querySelector('h2')?.innerText ?? '',
link: (n.querySelector('a') as HTMLAnchorElement)?.href ?? '',
}))
)
console.log(items)
await browser.close()
Puppeteer's positioning relative to Playwright is clear.
- Strengths: tightest binding to the Chrome DevTools Protocol. Integrations with Chrome-only features such as Lighthouse, web-vitals and PDF generation are cleanest here.
- Weaknesses: weak multi-browser support. The API feels older than Playwright's.
- Ecosystem asset:
puppeteer-extrapluspuppeteer-extra-plugin-stealth— the standard stealth plugin, even ported into the Playwright world.
For a new project Playwright is the recommendation, but if your goal is Chrome-specific PDF generation or rendering verification, Puppeteer is still a clean choice.
5. Selenium — Still Alive (Legacy plus WebDriver BiDi)
Selenium is the oldest browser-automation project, first released in 2004. In 2026 Selenium 4.x is the stable line, and 5.0 is in development, making W3C WebDriver BiDi a first-class citizen.
Selenium is often called slow and heavy compared with the modern tools (Playwright, Puppeteer), but it still leads in several areas.
- QA / test automation — the most widely used test tool in the enterprise.
- Grid plus Selenoid / Moon — the most mature container-based distributed-execution infrastructure.
- Legacy codebases — Selenium suites written in the 2010s are still running.
- The center of WebDriver BiDi (bi-directional protocol) standards work — in 2026, Chrome, Firefox and Edge all implement BiDi, so the bi-directional event streaming that was once Playwright's advantage is also available in Selenium.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://example.com/list")
WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, "article.post"))
)
posts = driver.find_elements(By.CSS_SELECTOR, "article.post")
for p in posts:
title = p.find_element(By.CSS_SELECTOR, "h2").text
link = p.find_element(By.CSS_SELECTOR, "a").get_attribute("href")
print(title, link)
driver.quit()
For scraping, the key asset is undetected-chromedriver — a fork that smooths out the Chrome WebDriver fingerprint to bypass bot detection. We cover it in Section 14.
6. Crawlee (Apify) — The Modern Scraping Framework
Crawlee is the open-source framework Apify released in 2022. It positions itself as a modern Node / Python replacement for Scrapy. The stable Crawlee for Python release in 2024 made it usable from Python.
The design philosophy is clear — "everything a scraper needs is in the box."
- Auto queue and retry —
RequestQueuepersists URLs. Crawls resume after a crash. - Auto proxy rotation — register a URL pool with
ProxyConfigurationand you are done. - Auto session management —
SessionPoolmanages cookie, UA and proxy combinations. - Auto concurrency —
AutoscaledPooldecides concurrency based on CPU, memory and response time. - Multiple crawlers —
HttpCrawler(fast, no JS),CheerioCrawler,PlaywrightCrawler, andPuppeteerCrawlershare the same API.
import { PlaywrightCrawler, Dataset } from 'crawlee'
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 1000,
async requestHandler({ page, request, enqueueLinks }) {
await page.waitForSelector('article.post')
const items = await page.$$eval('article.post', (nodes) =>
nodes.map((n) => ({
title: n.querySelector('h2')?.textContent ?? '',
link: (n.querySelector('a') as HTMLAnchorElement)?.href ?? '',
}))
)
await Dataset.pushData(items)
await enqueueLinks({ selector: 'a.next-page' })
},
})
await crawler.run(['https://example.com/list'])
That one snippet gives you pagination, automatic retry, a session pool, and auto-scaled concurrency. If you want Scrapy in Node, Crawlee is the answer.
7. Apify — Managed Scraping Cloud
Apify is the managed scraping platform run by the company behind Crawlee. It runs container-shaped jobs called Actors in the cloud, with managed input definitions, storage, proxies and a scheduler.
The big draw of Apify is the Actor Marketplace. As of May 2026 there are more than 5,000 prebuilt Actors — Google search-results scraping, Instagram profiles, Amazon products, LinkedIn companies, TikTok videos, and many more. Marketplace Actors are usually priced per result, in the 5 per 1,000-result range.
import { Actor } from 'apify'
import { PlaywrightCrawler } from 'crawlee'
await Actor.init()
const input = await Actor.getInput()
const crawler = new PlaywrightCrawler({
proxyConfiguration: await Actor.createProxyConfiguration(),
async requestHandler({ page, request }) {
const data = await page.evaluate(() => ({
title: document.querySelector('h1')?.textContent,
content: document.querySelector('article')?.innerText,
}))
await Actor.pushData(data)
},
})
await crawler.run(input.startUrls)
await Actor.exit()
Push this Actor with apify push and it ships to the platform; you trigger it from the web UI with input parameters, and results land in an Apify Dataset. Without a CLI you can also schedule it to run daily.
Pricing is measured in compute units (1 CU is roughly 1 GB-hour of memory), starting on a free $5-per-month credit tier. Proxies are billed separately. Apify Proxy has distinct list prices for datacenter, residential, SERP and Google-SERP categories.
8. Bright Data / Oxylabs / Smartproxy — Proxies plus Scraping Clouds
The biggest weapon in bot detection is IP-based blocking. Once one IP makes more than about ten requests per second, suspicion starts. That is why proxy companies exist.
Bright Data (formerly Luminati) has the largest residential proxy network in the industry. Public figures cite 150 million+ residential IPs, 72 million+ ISP proxies, 3 million+ datacenter proxies, and 700,000+ mobile proxies. By 2026 it offers a full-stack lineup: Web Unlocker (automatic captcha and anti-bot bypass API), SERP API (Google, Bing search results), Web Scraper IDE (write code in the browser), and Scraping Browser (a remote headless browser).
Pricing is per GB. Residential proxies are 15 per GB, datacenter 1 per GB.
Oxylabs is a Lithuania-based proxy company. It has a similar lineup (residential, datacenter, ISP, mobile) plus managed APIs like Real-Time Crawler, E-Commerce Scraper API and Web Unblocker. Pricing is similar to or slightly lower than Bright Data.
Smartproxy (recently rebranded as Decodo) targets a more affordable tier. Residential proxies are around 8 per GB, and the UI is the most beginner-friendly. It offers more freedom than the Apify Proxy model.
For scraping, proxy selection follows these rules:
- Datacenter proxies — cheap, higher block risk. Good for sites with weak bot detection.
- Residential proxies — expensive, almost no blocks. Good for big e-commerce and social.
- Mobile proxies — most expensive, but spoof mobile-app traffic. Effective on TikTok, Instagram.
- ISP proxies — static residential IPs. Use when session persistence matters.
9. ScrapingBee / Browserless / ZenRows — API Scraping Services
These services bundle proxy, headless browser and captcha bypass into "finish the job in one HTTP call".
ScrapingBee is the best-known of the bunch, built by a French company. Throw a URL at it, get the rendered HTML back.
curl "https://app.scrapingbee.com/api/v1/?api_key=YOUR_KEY&url=https://example.com&render_js=true&premium_proxy=true"
Pricing is credit-based. No JS render costs 1 credit, JS render 5, premium proxy 25, captcha bypass around 100. A 1,000-credit pack is about $9. For small projects this is the cheapest option.
Browserless is more developer-friendly. It lets you connect Puppeteer or Playwright to a remote Chrome over WebSocket, so you can keep your code and just point it at the cloud.
import puppeteer from 'puppeteer-core'
const browser = await puppeteer.connect({
browserWSEndpoint: 'wss://chrome.browserless.io?token=YOUR_TOKEN',
})
const page = await browser.newPage()
await page.goto('https://example.com')
const html = await page.content()
await browser.disconnect()
Pricing mixes per-session and per-hour, starting around $50 a month. You shed the burden of running your own Chrome containers.
ZenRows is the bot-bypass specialist of the latecomers. It bundles JS rendering plus premium proxies plus AntiBot bypass plus CSS-selector extraction into a single API and competes head-on with ScrapingBee. In 2026 ZenRows added a Universal Scraper API with LLM-friendly output — Markdown, JSON-LD, main-content extraction — automated for you.
ScrapingAnt, ScraperAPI and Apify Web Scraper are direct competitors in the same category.
10. AI / LLM-Friendly — Firecrawl / Jina AI Reader / Diffbot
This is a new category that appeared in 2024. Services that return web content already converted into LLM-friendly Markdown. They are tuned for RAG indexing and LLM training-data collection.
Firecrawl is a 2024 YC company. Hand it a URL and it returns LLM-friendly Markdown rather than HTML. JS rendering, automatic pagination, full-site crawling and web search all live behind one API.
curl -X POST https://api.firecrawl.dev/v1/scrape \
-H "Authorization: Bearer YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/article",
"formats": ["markdown", "html"]
}'
The response includes a markdown field plus metadata (title, description, sourceURL and more). When building a RAG dataset you no longer need to strip ads, navigation and footers by hand with Cheerio.
Pricing starts at $0.83 per 1,000 pages (Standard tier). The crawl API recursively pulls an entire site and returns a pages array. In 2026 an extract API was added — give the LLM an extraction schema and you get back structured JSON.
Jina AI Reader is the most attractive free option. Just prefix the URL with https://r.jina.ai/.
curl https://r.jina.ai/https://example.com/article
The response is LLM-friendly Markdown and you can call it without an API key (subject to rate limits). Registering an API key gives higher limits and extra features.
Jina's edge is that its backend is its own trained Reader-LM — a small model dedicated to HTML-to-Markdown (see next section). It is immediately usable for LLM training-data construction, the web-search and read step of AI agents, and RAG indexing.
Diffbot is the veteran, in business since 2008. AI-based automatic extraction — give it any page and it auto-detects the page type (article, product, discussion, event, etc.) and returns structured JSON. Pricing is steep (starts at $299 per month), but you stop writing per-site extraction logic for thousands of sites. In 2024 it added an LLM-friendly output mode as well.
11. Reader-LM (Jina, May 2024) — A Small Model Dedicated to HTML to Markdown
In May 2024 Jina AI released two small models, Reader-LM, at 0.5B and 1.5B parameters. The input is a large HTML document and the output is LLM-friendly Markdown — and that one task is all it does.
Why a dedicated model? Rule-based extractors like regex or readability.js struggle with ads, SPA markup and dynamic content; general-purpose LLMs (GPT-4 / Claude) blow through their 256k context and cost explodes. Reader-LM fills the gap with a task-specific small language model.
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("jinaai/reader-lm-1.5b")
model = AutoModelForCausalLM.from_pretrained("jinaai/reader-lm-1.5b")
html = open("page.html").read()
prompt = f"Convert the following HTML to clean Markdown:\n\n{html}"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=4096)
markdown = tokenizer.decode(outputs[0], skip_special_tokens=True)
In 2026 the successor Reader-LM-v2 ships with a 256k context and cleaner table and code-block extraction. It is available free on Hugging Face.
This model is also the backend of the Jina AI Reader API — every call to r.jina.ai triggers one inference pass through Reader-LM. You can self-host for the same output at zero per-call cost, but the decision comes down to whether GPU and ops cost beats per-call API fees.
12. AI Playwright — Stagehand / Browser Use / AnchorBrowser
A new category took off in late 2025: tools where AI drives the browser in natural language. They layer an LLM on top of Playwright so you can act on a page without writing selectors.
Stagehand is the open-source library from Browserbase. It adds three AI methods on top of the Playwright API: act, extract, observe.
import { Stagehand } from '@browserbasehq/stagehand'
const stagehand = new Stagehand({ env: 'LOCAL' })
await stagehand.init()
const page = stagehand.page
await page.goto('https://amazon.com')
await page.act({ action: 'search for headphones' })
await page.act({ action: 'click the first result' })
const product = await page.extract({
instruction: 'extract product title, price, and rating',
schema: z.object({
title: z.string(),
price: z.string(),
rating: z.number(),
}),
})
await stagehand.close()
Under the hood, Stagehand sends the page DOM to an LLM and asks decisions such as "Is this the search button?". When selectors change but intent does not, it still works. Browserbase is the managed cloud-browser service from the same company; you run Stagehand locally or in the cloud with the same code.
Browser Use is the Python competitor. Agent loop plus LLM plus Playwright — give it a natural-language goal (e.g. "fetch 10 Python backend jobs from Indeed") and the LLM clicks, scrolls and extracts as needed.
from browser_use import Agent
from langchain_openai import ChatOpenAI
agent = Agent(
task="Fetch the title and URL of 10 Python backend jobs from Indeed",
llm=ChatOpenAI(model="gpt-4o"),
)
result = await agent.run()
print(result)
AnchorBrowser is the more infrastructure-flavored option. Hand it Stagehand, Browser Use, or any existing Playwright code, and it runs the code on a cloud browser. A direct competitor to Browserbase.
This is the fastest-growing category in 2026. As AI agents (Claude, GPT) "see" and act on the web, selector-based scraping is steadily being replaced by LLM decisions plus visual grounding.
13. Parsing — Cheerio / Beautiful Soup 4 / lxml / jsoup / Goutte
Once you have the HTML, you parse it. Each language has a de facto standard.
Cheerio (Node) — jQuery-like API for HTML. Fast and light.
import * as cheerio from 'cheerio'
const html = await fetch('https://example.com').then((r) => r.text())
const $ = cheerio.load(html)
$('article.post').each((_i, el) => {
console.log($(el).find('h2').text(), $(el).find('a').attr('href'))
})
Beautiful Soup 4 (Python) — the most familiar Python HTML parser.
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get("https://example.com").text, "lxml")
for post in soup.select("article.post"):
print(post.h2.text, post.a["href"])
bs4 supports an internal parser choice of lxml, html.parser, or html5lib. Use lxml when speed matters, html5lib when standards compliance matters.
lxml (Python) — C bindings for libxml2. Five to ten times faster than bs4, and supports XPath 1.0. Use it for bulk processing and XML.
from lxml import html
tree = html.fromstring(open("page.html").read())
titles = tree.xpath("//article[@class='post']/h2/text()")
jsoup (Java) — the standard on the JVM. Provides jQuery-style selectors similar to Cheerio.
Document doc = Jsoup.connect("https://example.com").get();
for (Element post : doc.select("article.post")) {
System.out.println(post.select("h2").text());
}
Goutte (PHP) — a scraper plus parser built by Symfony, bundling Symfony BrowserKit and DomCrawler. Around 2023 it was folded back into using Symfony BrowserKit and HttpClient directly, but the Goutte name is still in common use.
Mechanize (Python / Ruby) — a "programmable browser" specialized for form submission and cookie handling. No JS, but still useful for form-driven sites like courts, libraries, and government.
The selection rule is simple: follow your language's standard. The one caveat is Python at scale — prefer lxml directly over bs4.
14. Stealth — puppeteer-extra-stealth / undetected-chromedriver / Camoufox
Bot detection became a serious arms race in the 2020s. Cloudflare, DataDome, Akamai Bot Manager, and PerimeterX inspect signals such as:
navigator.webdriver === true(a headless-browser tell)- WebGL, Canvas, AudioContext fingerprints
- TLS fingerprint (JA3 / JA4)
- HTTP/2 fingerprint
- Mouse and keyboard event patterns (does the cursor move like a human?)
- Traces of CDP (Chrome DevTools Protocol)
Here are the tools that bypass them.
puppeteer-extra-plugin-stealth — the most-used stealth plugin for both Puppeteer and Playwright. It rolls up twenty-plus bypass techniques: removing navigator.webdriver, registering fake plugins, faking WebGL fingerprints, mocking the Chrome runtime, and more.
import puppeteer from 'puppeteer-extra'
import StealthPlugin from 'puppeteer-extra-plugin-stealth'
puppeteer.use(StealthPlugin())
const browser = await puppeteer.launch({ headless: false })
undetected-chromedriver (Python) — the Selenium-side standard. It rebuilds the Chrome WebDriver binary to scrub CDP-detection signals. Widely considered one of the most effective tools against Cloudflare-protected sites.
import undetected_chromedriver as uc
driver = uc.Chrome(headless=False)
driver.get("https://cloudflare-protected-site.com")
Camoufox — a new stealth browser based on a Firefox fork. Released in 2024. It uses the same interfaces as Puppeteer / Playwright, but the browser engine itself is a custom build that scrubs bot-detection signals. It approaches Mullvad Browser-grade fingerprint resistance, and is considered one of the strongest bypass options today.
from camoufox.sync_api import Camoufox
with Camoufox(headless=False) as browser:
page = browser.new_page()
page.goto("https://example.com")
Proxies plus stealth plus humanlike timing are separate axes. You need all three to slip past about 80 percent of bot detection. That is exactly why managed services like ZenRows and Bright Data Web Unlocker exist — paying per-GB fees beats running all three yourself.
15. robots.txt + sitemap.xml + Crawl Ethics
Before you reach for tools, learn what is allowed.
robots.txt — the crawler policy at the site's root, /robots.txt. A social contract (not law) that every crawler should follow.
User-agent: *
Disallow: /admin/
Disallow: /api/internal/
Crawl-delay: 10
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
Sitemap: https://example.com/sitemap.xml
Crawl-delay advises a minimum interval (in seconds) between requests. Explicit LLM-bot blocks like User-agent: GPTBot exploded between 2023 and 2024; by 2026 most major sites block LLM-training crawlers in robots.txt.
sitemap.xml — the file in which a site explicitly lists its pages. It contains URL, lastmod, changefreq and priority. Crawling along the sitemap is much more efficient than BFS-style page discovery, and lighter on the host.
After IETF standardization in 2022 as the Robots Exclusion Protocol (REP), robots.txt was upgraded from "advisory" to "a standard you should comply with". Some US and EU rulings have treated ignoring robots.txt as unauthorized access.
Crawl ethics checklist:
- Read robots.txt and exclude blocked paths.
- Honor
Crawl-delay(default to one second or more if not set). - Set an identifying UA (
Mozilla/5.0 (compatible; YourBotName/1.0; +https://your-site.com/bot)). - Keep concurrent requests against the same domain to one to four.
- Honor cache headers (
If-Modified-Since,ETag). - Run legal / DPO review if collecting personal data or copyrighted content.
- Preserve source attribution in the collected data (especially for LLM training).
A bot that follows the checklist is rarely blocked. Bots that are blocked are usually too fast, with a suspicious UA, and ignore robots.txt.
16. Korea / Japan — Toss, Kakao, Mercari Scraping Policies
Scraping policy varies by region. Let us look at cases from Korea and Japan.
Korea — Toss / TossPayments: Toss explicitly forbids automated access to its services in its terms. In 2023 Toss took legal action when a fintech comparison service crawled Toss pages. In 2017 the Korean Supreme Court ruling in Saramin vs JobKorea set the precedent that "even public information can become a database-rights infringement when automated mass collection is involved". In other words, being public does not make a site safe to crawl.
Korea — Kakao / Daum: Daum Search runs a Kakao search bot that indexes Korean sites. Per Kakao's public statements, Daum's bot uses User-agent: Daum and strictly honors robots.txt. SEO in Korea means considering all three: Naver's bot (Yeti), Daum's bot (Daum), and Googlebot.
Japan — Mercari: Mercari Article 7.1 of its terms explicitly bans "data access through APIs or automated tools." In 2019 Mercari obtained an injunction in a Japanese court against a price-comparison service that scraped its data. Japan's Copyright Act, after the 2018 amendment, broadly permits "reproduction for information analysis" (Article 30-4), but site terms are interpreted separately.
Japan — Curation sites and the NAVER Matome case: NAVER Matome in Japan (shut down in 2020) was a curation service where users compiled content from other sites. After many copyright disputes, NAVER closed the service. Although it was user curation, not automated scraping, the absence of attribution and the high-volume reuse were what caused the dispute.
The takeaway from these three cases: public does not equal allowed. Always check the site's terms, robots.txt, and the database-rights and copyright law in the target country. The same applies all the more when the data feeds LLM training.
17. Who Should Pick What — Five Scenarios
With the tool map covered, here is the decision matrix.
Scenario 1: Student / side project — one site, 100 to 1,000 pages → Python plus Scrapy, or Playwright. Write the code yourself. Free. Maximum learning value.
Scenario 2: Indie hacker / SaaS data — 10 to 50 sites, daily refresh → Crawlee (Node) or Scrapy plus scrapy-playwright. Ship it to Apify and run on a cron, or your own VPS. For proxies, a Smartproxy or Bright Data datacenter GB package is plenty.
Scenario 3: Monitoring / price comparison — sites with strong bot detection → Crawlee plus puppeteer-extra-stealth plus a Bright Data residential proxy. Or a ZenRows / ScrapingBee API. Compare self-hosting vs SaaS cost before deciding.
Scenario 4: LLM training data / RAG — thousands of sites, Markdown output
→ Firecrawl's crawl API or Jina AI Reader. Both return LLM-friendly Markdown. If you are cost-sensitive, pick Jina; if you want more options, pick Firecrawl.
Scenario 5: AI agent driving the browser — no selectors, please → Stagehand (TypeScript) plus Browserbase, or Browser Use (Python). Note the LLM API cost on top.
Cost intuition (per 1,000 pages):
- Self-hosted (server plus proxy GB): 10
- Apify Actor (marketplace): 5
- ScrapingBee / ZenRows API: 20
- Bright Data Web Unlocker: 10
- Firecrawl: 3
- Jina AI Reader paid: about $0.5
- Jina AI Reader free tier: 20–200 requests per minute
- Stagehand plus Browserbase: 50 (LLM API cost included)
Price is the inverse of ops burden. Self-hosting is cheaper but costs your time.
18. References
- Scrapy documentation —
https://docs.scrapy.org/ - Scrapy GitHub —
https://github.com/scrapy/scrapy - Playwright Python —
https://playwright.dev/python/ - Playwright Node —
https://playwright.dev/ - Playwright MCP —
https://github.com/microsoft/playwright-mcp - Puppeteer documentation —
https://pptr.dev/ - puppeteer-extra-plugin-stealth —
https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth - Selenium documentation —
https://www.selenium.dev/documentation/ - WebDriver BiDi specification —
https://w3c.github.io/webdriver-bidi/ - undetected-chromedriver —
https://github.com/ultrafunkamsterdam/undetected-chromedriver - Crawlee for Node —
https://crawlee.dev/ - Crawlee for Python —
https://crawlee.dev/python/ - Apify platform —
https://apify.com/ - Apify Store —
https://apify.com/store - Bright Data —
https://brightdata.com/ - Oxylabs —
https://oxylabs.io/ - Smartproxy / Decodo —
https://decodo.com/ - ScrapingBee —
https://www.scrapingbee.com/ - Browserless —
https://www.browserless.io/ - ZenRows —
https://www.zenrows.com/ - ScrapingAnt —
https://scrapingant.com/ - Firecrawl —
https://www.firecrawl.dev/ - Firecrawl GitHub —
https://github.com/mendableai/firecrawl - Jina AI Reader —
https://jina.ai/reader/ - Reader-LM model card —
https://huggingface.co/jinaai/reader-lm-1.5b - Diffbot —
https://www.diffbot.com/ - Stagehand —
https://www.stagehand.dev/ - Stagehand GitHub —
https://github.com/browserbase/stagehand - Browserbase —
https://www.browserbase.com/ - Browser Use —
https://github.com/browser-use/browser-use - AnchorBrowser —
https://anchorbrowser.io/ - Camoufox —
https://github.com/daijro/camoufox - Cheerio —
https://cheerio.js.org/ - Beautiful Soup —
https://www.crummy.com/software/BeautifulSoup/ - lxml —
https://lxml.de/ - jsoup —
https://jsoup.org/ - Goutte —
https://github.com/FriendsOfPHP/Goutte - Mechanize (Python) —
https://github.com/python-mechanize/mechanize - Robots Exclusion Protocol (RFC 9309) —
https://www.rfc-editor.org/rfc/rfc9309.html - Sitemaps XML format —
https://www.sitemaps.org/protocol.html - Common Crawl —
https://commoncrawl.org/ - TossPayments terms —
https://docs.tosspayments.com/ - Japan Copyright Act Article 30-4 —
https://www.bunka.go.jp/seisaku/chosakuken/ - Mercari Terms of Service —
https://www.mercari.com/jp/tos/