Blocking Crawls From Cloudflare's Browser Crawl Endpoint
Earlier this week, Cloudflare announced the introduction of their Browser Crawl Endpoint.
This allows Cloudflare users to crawl an entire website by making a _single_ API call to the Browser rendering service.
Although the browser rendering service honours robots.txt they don't define a specific User-Agent that the service will check for, apparently instead expecting website operators to disallow **all** user agents if they want to keep Cloudflare out.
However, they have also documented that the service includes Cloudflare specific request headers, allowing requests to be blocked by checking for those.
This post details how to achieve that on BunnyCDN, Nginx and Openresty.
* * *
### The Headers
The relevant header _names_ are documented here. However, unhelpfully, Cloudflare have not provided example/expected values so I had to go digging.
`cf-brapi-request-id` contains a unique request ID so, although you can check for the existence of it, relying on the value being a consistent format may be unwise.
`Signature-agent` is a little bit more useful. The automatic request headers documentation indicates that the value will point to a path under `https://web-bot-auth.cloudflare-browser-rendering-085.workers.dev/`. It is, however, unclear whether this will always be the case (the inclusion of a number suggests that it may not).
* * *
### BunnyCDN
BunnyCDN allows the creation of edge rules which can match against request headers.
Although they don't provide an explicit way to test for the existence of a header, their glob support allows us to achieve the same effect:
Action: Block Request
Conditions: Match Any
|
| Request Header
Header Name: Signature-agent
Value: https://web-bot-auth.cloudflare-browser-rendering*
|
| Request Header
Header Name: cf-brapi-request-id
Value: *
Within the web UI, the conditions look like this:
* * *
### Nginx
Requests can also be blocked in Nginx:
if ($http_signature_agent ~ "^https://web-bot-auth.cloudflare-browser-rendering(.*)") {
return 403;
}
if ($http_cf_brapi_request_id){
return 403;
}
Note: although _if is evil_ it's considered that using `return` is 100% safe.
* * *
#### OpenResty
If you're using OpenResty you can still use the Nginx config, but can also achieve the same in LUA:
local h = ngx.req.get_headers()
if h["cf-brapi-request-id"] then
return ngx.exit(403)
end
if h["signature-agent"] and h["signature-agent"] ~= "https://web-bot-auth.cloudflare-browser-rendering*" then
return ngx.exit(403)
end
This snippet can easily be included in a `header_filter_by_lua` block with custom response headers added for debugging purposes:
header_filter_by_lua '
local h = ngx.req.get_headers()
if h["cf-brapi-request-id"] then
ngx.header["x-reason"] = "Foxtrot Oscar my old buddy"
return ngx.exit(403)
end
if h["signature-agent"] and h["signature-agent"] ~= "https://web-bot-auth.cloudflare-browser-rendering*" then
ngx.header["x-reason"] = "Sign this..."
return ngx.exit(403)
end
';
* * *
### Conclusion
I already have more than enough unwanted traffic hitting my servers without Cloudflare giving others an off-the-shelf ability to one-shot my services.
To give Cloudflare their dues, though, they have at least documented how to block their browser rendering service. It could _perhaps_ have been more clearly documented, but the information is at least there.
Still, it would have been nice if they could have defined a _specific_ user-agent to be added to `robots.txt` rather than expecting people to check headers on every request.
Blocking Crawls From Cloudflare's Browser Crawl Endpoint
Author: Ben Tasker
www.bentasker.co.uk/posts/documentation/gene...
#bots #bunnycdn #cloudflare #nginx #openresty
0
0
0
0