Blocking bots in Nginx

#nginx #linux #proxy

At bibliotecas.uncuyo.edu.ar we have multiple services running behind a reverse proxy based on nginx.
For days now all systems have been slowing down. Analyzing the usage logs we have found a massive increase in "visits" from AI bots.
How do we block them?
Using rules in the definition of the proxy_hosts

if ($http_user_agent ~* "amazonbot|Claudebot|claudebot|DataForSeoBot|dataforseobot|Amazonbot|SemrushBot|Semrush|AhrefsBot|MJ12bot|YandexBot|YandexImages|MegaIndex.ru|BLEXbot|BLEXBot|ZoominfoBot|YaK|VelenPublicWebCrawler|SentiBot|Vagabondo|SEOkicks|SEOkicks-Robot|mtbot/1.1.0i|SeznamBot|DotBot|Cliqzbot|coccocbot|python|Scrap|SiteCheck-sitecrawl|MauiBot|Java|GumGum|Clickagy|AspiegelBot|Yandex|TkBot|CCBot|Qwantify|MBCrawler|serpstatbot|AwarioSmartBot|Semantici|ScholarBot|proximic|MojeekBot|GrapeshotCrawler|IAScrawler|linkdexbot|contxbot|PlurkBot|PaperLiBot|BomboraBot|Leikibot|weborama-fetcher|NTENTbot|Screaming Frog SEO Spider|admantx-usaspb|Eyeotabot|VoluumDSP-content-bot|SirdataBot|adbeat_bot|TTD-Content|admantx|Nimbostratus-Bot|Mail.RU_Bot|Quantcastboti|Onespot-ScraperBot|Taboolabot|Baidu|Jobboerse|VoilaBot|Sogou|Jyxobot|Exabot|ZGrab|Proximi|Sosospider|Accoona|aiHitBot|Genieo|BecomeBot|ConveraCrawler|NerdyBot|OutclicksBot|findlinks|JikeSpider|Gigabot|CatchBot|Huaweisymantecspider|Offline Explorer|SiteSnagger|TeleportPro|WebCopier|WebReaper|WebStripper|WebZIP|Xaldon_WebSpider|BackDoorBot|AITCSRoboti|Arachnophilia|BackRub|BlowFishi|perl|CherryPicker|CyberSpyder|EmailCollector|Foobot|GetURL|httplib|HTTrack|LinkScan|Openbot|Snooper|SuperBot|URLSpiderPro|MAZBot|EchoboxBot|SerendeputyBot|LivelapBot|linkfluence.com|TweetmemeBot|LinkisBot|CrowdTanglebot") { return 403; }

In our case, since we use proxymanager to manage the different domains, the entry of this configuration is done in the advanced section

Top comments (1)

MUHAMMED YAZEEN AN • Mar 2 • Edited

Great article! Blocking bots using User-Agent strings is a good starting point, and you've explained it really well.

I just wanted to add that User-Agent blocking can sometimes be bypassed since the User-Agent header can be easily spoofed. To make bot blocking more robust, we could combine it with other techniques like:

Rate limiting: Restrict the number of requests a client can make in a short time.
IP blocking: Block known malicious IPs or ranges.
Behavior-based detection: Identify bots by analyzing unusual patterns like high request rates, skipping resources, or accessing non-existent pages.
JavaScript challenges: Verify if the client can execute JavaScript, as most bots cannot.
CAPTCHAs: Add a CAPTCHA to sensitive areas like login pages or forms. -** Advanced abilities**: Services like Cloudflare or AWS WAF can provide more comprehensive bot protection. Combining these techniques can help create a stronger defense against bots. Thanks again for sharing this—it’s a great resource for anyone looking to get started!

DEV Community

Blocking bots in Nginx

Top comments (1)

Free AI chart generator

Okay