Cloudflare Crawl Endpoint

Posted by jeffpalmer 3 hours ago

Comments

Comment by greatgib 1 hour ago

All what was expected, first they do a huge campaign to out evil scrapers. We should use their service to ensure your website block llm and bots to come scraping them. Look how bad it is.

And once that is setup, and you have you walled garden, then you can present your own api to scrape website. All well done to be used by your llm. But as you know, we are the gate keeper so that the Mafia boss decide what will be the "intermediate" that is proper for itself to let you do what you were doing without intermediary before.

Comment by shadowfiend 1 hour ago

No: https://developers.cloudflare.com/browser-rendering/rest-api...

Comment by x0x0 27 minutes ago

most websites, particularly those behind cloudflare, are very restrictive even to crawlers that obey robots. Proof: a ton of my time over the last year, and my crawlers very carefully obey robots.

It's hard to see how this isn't extorting folks by offering a working solution that, oh, cloudflare doesn't block. As long as you pay Cloudflare.

Perhaps I'm overly cynical, but I'd be quite surprised if cloudflare subjected their own headless browsing to the same rules the rest of the internet gets.

Comment by gruez 16 minutes ago

>most websites, particularly those behind cloudflare, are very restrictive even to crawlers that obey robots. Proof: a ton of my time over the last year, and my crawlers very carefully obey robots.

The docs are pretty equivocal though:

>If you use Cloudflare products that control or restrict bot traffic such as Bot Management, Web Application Firewall (WAF), or Turnstile, the same rules will apply to the Browser Rendering crawler.

It's not just robots.txt. Most (all?) restrictions that apply to outside bots apply to cloudflare's bot as well, at least that's what they're claiming. If they're being this explicit about it, I'm willing to give them the benefit of the doubt until there's evidence to the contrary, rather than being a cynic and assuming the worst.

Comment by arjunchint 2 minutes ago

RIP @FireCrawl or at the very least they were the inspiration for this?

Comment by jasongill 2 hours ago

I'm surprised that Cloudflare hasn't started hosting a pre-scraped version of websites that use Cloudflare's proxy - something like https://www.example.com/cdn-cgi/cached-contents.json They already have the website content in their cache, so why not just cut out the middle man of scraping services and API's like this and publish it?

Obviously there's good reasons NOT to, but I am surprised they haven't started offering it (as an "on-by-default" option, naturally) yet.

Comment by michaelmior 1 hour ago

> I'm surprised that Cloudflare hasn't started hosting a pre-scraped version of websites that use Cloudflare's proxy

It's entirely possible that they're doing this under the hood for cases where they can clearly identify the content they have cached is public.

Comment by binarymax 1 hour ago

Based on the post, it seems likely that they'd just delay per the robots.txt policy no matter what, and do a full browser render of the cached page to get the content. Probably overkill for lots and lots of sites. An HTML fetch + readability is really cheap.

Comment by janalsncm 59 minutes ago

How would they know the content hasn’t changed without hitting the website?

Comment by selcuka 1 hour ago

Not the same thing, but they have something close (it's not on-by-default, yet) [1]:

> Cloudflare's network now supports real-time content conversion at the source, for enabled zones using content negotiation headers. Now when AI systems request pages from any website that uses Cloudflare and has Markdown for Agents enabled, they can express the preference for text/markdown in the request. Our network will automatically and efficiently convert the HTML to markdown, when possible, on the fly.

[1] https://blog.cloudflare.com/markdown-for-agents/

Comment by cmsparks 1 hour ago

That would prolly work for simple sites, but you still need the dedicated scraping service with a browser to render sites that are more complex (i.e. SPAs)

Comment by csomar 1 hour ago

It’s a bit more complicated than that. This is their product Browser Rendering, which runs a real browser that loads the page and executes JavaScript. It’s a bit more involved than a simple curl scraping.

Comment by ljm 1 hour ago

Is cloudflare becoming a mob outfit? Because they are selling scraping countermeasures but are now selling scraping too.

And they can pull it off because of their reach over the internet with the free DNS.

Comment by shadowfiend 1 hour ago

No: https://developers.cloudflare.com/browser-rendering/rest-api...

Comment by iso-logi 1 hour ago

Their free DNS is only a small piece of the pie.

The fact that 30%+ of the web relies on their caching services, routablility services and DDoS protection services is the main pull.

Their DNS is only really for data collection and to front as "good will"

Comment by subscribed 28 minutes ago

I think there's some space being absolutely snuffed by the countless bots of everyone, ignoring everything, pulling from residential proxies, and this, supposedly slower, well behavior, smarter bot.

Like there's a difference between dozens of drunk teenagers thrashing the city streets in the illegal street race vs a taxi driver.

Comment by theamk 1 hour ago

no? it takes 10 seconds to check:

> The /crawl endpoint respects the directives of robots.txt files, including crawl-delay. All URLs that /crawl is directed not to crawl are listed in the response with "status": "disallowed".

You don't need any scraping countermeasures for crawlers like those.

Comment by Macha 17 minutes ago

So what’s the user agent for their bot? They don’t seem to specify the default in the docs and it looks like it’s user configurable. So yet another opt out bot which you need your web server to match on special behaviour to block

Comment by gruez 13 minutes ago

>So yet another opt out bot which you need your web server to match on special behaviour to block

Given that malicious bots are allegedly spoofing real user agents, "another user agent you have to add to your list" seems like the least of your problems.

Comment by its-kostya 1 hour ago

Cloudflare has been trying to mediate publishers & AI companies. If publishers are behind Cloudflare and Cloudflare's bot detection stops scrapers at the request of publishers, the publishers can allow their data to be scraped (via this end point) for a price. It creates market scarcity. I don't believe the target audience is you and me. Unless you own a very popular blog that AI companies would pay you for.

Comment by giancarlostoro 59 minutes ago

If they ever sell or the CEO shifts, yes. For the meantime, they have not given any strong indication that they're trying to bully anybody. I could see things changing drastically if the people in charge are swapped out.

Comment by rrr_oh_man 1 hour ago

It’s a three letter agency front.

Comment by stri8ted 1 hour ago

Do you have any evidence to support this view?

Comment by rolymath 31 minutes ago

Read who and how it was founded. It's not a secret at all.

Comment by mtmail 1 hour ago

Any kind of source for the claim?

Comment by Retr0id 1 hour ago

For a long time cloudflare has proudly protected DDoS-as-a-service sites (but of course, they claim they don't "host" them)

Comment by david_iqlabs 23 minutes ago

Crawling pages is actually the easy part. The harder problem is turning the crawl data into something useful. I’ve found the real work is extracting clear signals from the mess of HTML and responses.

Comment by everfrustrated 1 hour ago

Will this crawler be run behind or infront of their bot blocker logic?

Comment by shadowfiend 1 hour ago

In front: https://developers.cloudflare.com/browser-rendering/rest-api...

Comment by arjie 24 minutes ago

Oh man, I was hoping I could offer a nicely-crawled version of my site. It would be cool if they offered that for site admins. Then everyone who wanted to crawl would just get a thing they could get for pure transfer cost. I suppose I could build one by submitting a crawl job against myself and then offering a `static.` subdomain on each thing that people could access. Then it's pure HTML instant-load.

Comment by echoangle 9 minutes ago

I don’t really get the usecase. Is your site static? Then you should just render it to html files and host the static files. And if it’s not static, how would a snapshot of the pages help if they change later? And also why not just add some caching to the site then?

Comment by devnotes77 1 hour ago

Worth noting: origin owners can still detect and block CF Browser Rendering requests if needed.

Workers-originated requests include a CF-Worker header identifying the workers subdomain, which distinguishes them from regular CDN proxying. You can match on this in a WAF rule or origin middleware.

The trickier issue: rendered requests originate from Cloudflare ASN 13335 with a low bot score, so if you rely on CF bot scores for content protection, requests through their own crawl product will bypass that check. The practical defense is application-layer rate limiting and behavioral analysis rather than network-level scores -- which is better practice regardless.

The structural conflict is real but similar to search engines offering webmaster tools while running the index. The incentives are misaligned, but the individual products have independent utility. The harder question is whether the combination makes it meaningfully harder to build effective bot protection on top of their platform.

Comment by radium3d 1 hour ago

Instead of "should have been an email" this is "should have been a prompt" and can be run locally instead. There are a number of ways to do this from a linux terminal.

``` write a custom crawler that will crawl every page on a site (internal links to the original domain only, scroll down to mimic a human, and save the output as a WebP screenshot, HTML, Markdown, and structured JSON. Make it designed to run locally in a terminal on a linux machine using headless Google Chrome and take advantage of multiple cores to run multiple pages simultaneously while keeping in mind that it might have to throttle if the server gets hit too fast from the same IP. ```

Might use available open source software such as python, playwright, beautifulsoup4, pillow, aiofiles, trafilatura

Comment by Normal_gaussian 31 minutes ago

This presumably is going to be cheap and effective. Its much easier to wrap a prompt round this and know it works that mess around with crawling it all yourself.

You'll still be hand-rolling it if you want to disrespect crawling requirements though.

Comment by supermdguy 17 minutes ago

I’ve actually written a crawler like that before, and still ended up going with Firecrawl for a more recent project. There’s just so many headaches at scale: OOMs from heavy pages, proxies for sites that block cloud IPs, handling nested iframes, etc.

Comment by Normal_gaussian 21 minutes ago

"Well-behaved bot - Honors robots.txt directives, including crawl-delay"

From the behaviour of our peers, this seems to be the real headline news.

Comment by jppope 1 hour ago

This is actually really amazing. Cloudflare is just skating to where the puck is going to be on this one.

Comment by pupppet 1 hour ago

Cloudflare getting all the cool toys. AWS, anyone awake over there?

Comment by skybrian 32 minutes ago

If two customers crawl the same website and it uses crawl-delay, how does it handle that? Are they independent, or does each one run half as fast?

Comment by patchnull 1 hour ago

The main win here is abstracting away browser context lifecycle management. Anyone who has run Puppeteer on Workers knows the pain of handling cold starts, context reuse, and timeout cascading across navigation steps. Having crawl() bundle render-then-extract into one call covers maybe 80% of scraping use cases. The remaining 20% where you need request interception or pre-render script injection still needs the full Browser Rendering API, but for pulling structured data from public pages this is a big simplification over managing session state yourself.

Comment by binarymax 1 hour ago

Really hard to understand costs here. What is a reasonable pages per second? Should I assume with politeness that I'm basically at 1 page per second == 3600 pages/hour? Seems painfully slow.

Comment by triwats 2 hours ago

this could be cool to use cloudflare's edge to do some monitoring of endpoints actual content for synthetic monitoring

Comment by 1 hour ago

Comment by devnotes77 1 hour ago

To clarify the two questions raised:

First, the Cloudflare Crawl endpoint does not require the target site to use Cloudflare. It spins up a headless Chrome instance (via the Browser Rendering API) that fetches and renders any publicly accessible URL. You could crawl a site hosted on Hetzner or a bare VPS with the same call.

Second on pricing: Browser Rendering is only available on the Workers Paid plan ($5/month). It is not part of the free tier. Usage is billed per invocation beyond the included quota - the exact limits are in the Cloudflare docs under Browser Rendering pricing, but for archival use cases with moderate crawl rates you are very unlikely to run into meaningful costs.

The practical gotcha for forum archival is pagination and authentication-gated content. If the forum requires a login to see older posts, a headless browser session with saved cookies would help, but that is more complex to orchestrate than a single-shot fetch.

Comment by babelfish 1 hour ago

Didn't they just throw a (very public) fit over Perplexity doing the exact same thing?

Comment by 8cvor6j844qw_d6 2 hours ago

Does this bypass their own anti-AI crawl measures?

I'll need to test it out, especially with the labyrinth.

Comment by jsheard 1 hour ago

They say it doesn't: https://developers.cloudflare.com/browser-rendering/faq/#wil...

Further down they also mention that the requests come from CFs ASN and are branded with identifying headers, so third party filters could easily block them too if they're so inclined. Seems reasonable enough.

Comment by xhcuvuvyc 2 hours ago

Yeah, that'd be huge, like 90% of my search engine results are just cloudflare bot checks if I don't filter it out.

Comment by mdasen 1 hour ago

If this does bypass their own (and others') anti-AI crawl measures, it'd basically mean that the only people who can't crawl are those without money.

We're creating an internet that is becoming self-reinforcing for those who already have power and harder for anyone else. As crawling becomes difficult and expensive, only those with previously collected datasets get to play. I certainly understand individual sites wanting to limit access, but it seems unlikely that they're limiting access to the big players - and maybe even helping them since others won't be able to compete as well.

Comment by adi_kurian 1 hour ago

Common Crawl has free egress

Comment by canpan 1 hour ago

I feel there is a conflict of interest here..

I'm split between: Yes! At last something to get CF protected sites! And: Uh! Now the internet is successfully centralized.

Comment by memothon 2 hours ago

I've used browser rendering at work and it's quite nice. Most solutions in the crawling space are kind of scummy and designed for side-stepping robots.txt and not being a good citizen. A crawl endpoint is a very necessary addition!

Comment by rvz 1 hour ago

Selling the cure (DDoS protection) and creating the poison (Authorized AI crawling) against their customers.

Comment by Imustaskforhelp 1 hour ago

This might be really great!

I had the idea after buying https://mirror.forum recently (which I talked in discord and archiveteam irc servers) that I wanted to preserve/mirror forums (especially tech) related [Think TinyCoreLinux] since Archive.org is really really great but I would prefer some other efforts as well within this space.

I didn't want to scrape/crawl it myself because I felt like it would feel like yet another scraping effort for AI and strain resources of developers.

And even when you want to crawl, the issue is that you can't crawl cloudflare and sometimes for good measure.

So in my understanding, can I use Cloudflare Crawl to essentially crawl the whole website of a forum and does this only work for forums which use cloudflare ?

Also what is the pricing of this? Is it just a standard cloudflare worker so would I get free 100k requests and 1 Million per the few cents (IIRC) offer for crawling. Considering that Cloudflare is very scalable, It might even make sense more than buying a group of cheap VPS's

Also another point but I was previously thinking that the best way was probably if maintainers of these forums could give me a backup archive of the forum in a periodic manner as my heart believes it to be most cleanest way and discussing it on Linux discord servers and archivers within that community and in general, I couldn't find anyone who maintains such tech forums who can subscribe to the idea of sharing the forum's public data as a quick backup for preservation purposes. So if anyone knows or maintains any forums myself. Feel free to message here in this thread about that too.

Comment by ipaddr 1 hour ago

"I didn't want to scrape/crawl it myself because I felt like it would feel like yet another scraping effort for AI and strain resources of developers"

You feel better paying someone to do the same thimg?

Comment by Imustaskforhelp 1 hour ago

I actually don't but it seems that cloudflare caches responses so if anything instead of straining the developer resources, it would strain more cloudflare resources and cloudflare could better handle that more efficiently with their own crawl product.

Also, I am genuinely open to feedback (Like a lot) so just let me know if you know of any other alternative too for the particular thing that I wish to create and I would love to have a discussion about that too! I genuinely wish that there can be other ways and part of the reason why I wrote that comment was wishing that someone who manages forums or knows people who do can comment back and we can have a discussion/something-meaningful!

I am also happy with you also suggesting me any good use cases of the domain in general if there can be made anything useful with it. In fact, I am happy with transferring this domain to you if this is something which is useful to ya or anyone here (Just donate some money preferably 50-100$ to any great charity in date after this comment is made and mail me details and I am absolutely willing to transfer the domain, or if you work in any charity currently and if it could help the charity in any meaningful manner!)

I had actually asked archive team if I could donate the domain to them if it would help archive.org in any meaningful way and they essentially politely declined.

I just bought this domain because someone on HN said mirror.org when they wanted to show someone else mirror and saw the price of the .org domain being so high (150k$ or similar)and I have habit of finding random nice TLD and I found mirror.forum so I bought it

And I was just thinking of hmm what can be a decent idea now that I have bought it and had thought of that. Obviously I have my flaws (many actually) but I genuinely don't wish any harm to anybody especially those people who are passionate about running independent forums in this centralized-web. I'd rather have this domain be expired if its activation meant harm to anybody.

looking forward to discussion with ya.