DEV Community

Horacio Degiorgi
Horacio Degiorgi

Posted on

2

Blocking bots in Nginx

At bibliotecas.uncuyo.edu.ar we have multiple services running behind a reverse proxy based on nginx.
For days now all systems have been slowing down. Analyzing the usage logs we have found a massive increase in "visits" from AI bots.
How do we block them?
Using rules in the definition of the proxy_hosts

if ($http_user_agent ~* "amazonbot|Claudebot|claudebot|DataForSeoBot|dataforseobot|Amazonbot|SemrushBot|Semrush|AhrefsBot|MJ12bot|YandexBot|YandexImages|MegaIndex.ru|BLEXbot|BLEXBot|ZoominfoBot|YaK|VelenPublicWebCrawler|SentiBot|Vagabondo|SEOkicks|SEOkicks-Robot|mtbot/1.1.0i|SeznamBot|DotBot|Cliqzbot|coccocbot|python|Scrap|SiteCheck-sitecrawl|MauiBot|Java|GumGum|Clickagy|AspiegelBot|Yandex|TkBot|CCBot|Qwantify|MBCrawler|serpstatbot|AwarioSmartBot|Semantici|ScholarBot|proximic|MojeekBot|GrapeshotCrawler|IAScrawler|linkdexbot|contxbot|PlurkBot|PaperLiBot|BomboraBot|Leikibot|weborama-fetcher|NTENTbot|Screaming Frog SEO Spider|admantx-usaspb|Eyeotabot|VoluumDSP-content-bot|SirdataBot|adbeat_bot|TTD-Content|admantx|Nimbostratus-Bot|Mail.RU_Bot|Quantcastboti|Onespot-ScraperBot|Taboolabot|Baidu|Jobboerse|VoilaBot|Sogou|Jyxobot|Exabot|ZGrab|Proximi|Sosospider|Accoona|aiHitBot|Genieo|BecomeBot|ConveraCrawler|NerdyBot|OutclicksBot|findlinks|JikeSpider|Gigabot|CatchBot|Huaweisymantecspider|Offline Explorer|SiteSnagger|TeleportPro|WebCopier|WebReaper|WebStripper|WebZIP|Xaldon_WebSpider|BackDoorBot|AITCSRoboti|Arachnophilia|BackRub|BlowFishi|perl|CherryPicker|CyberSpyder|EmailCollector|Foobot|GetURL|httplib|HTTrack|LinkScan|Openbot|Snooper|SuperBot|URLSpiderPro|MAZBot|EchoboxBot|SerendeputyBot|LivelapBot|linkfluence.com|TweetmemeBot|LinkisBot|CrowdTanglebot") { return 403; }
Enter fullscreen mode Exit fullscreen mode

In our case, since we use proxymanager to manage the different domains, the entry of this configuration is done in the advanced section

advanced conf in proxymanager

Top comments (1)

Collapse
 
messenger1012 profile image
MUHAMMED YAZEEN AN • Edited

Great article! Blocking bots using User-Agent strings is a good starting point, and you've explained it really well.

I just wanted to add that User-Agent blocking can sometimes be bypassed since the User-Agent header can be easily spoofed. To make bot blocking more robust, we could combine it with other techniques like:

  • Rate limiting: Restrict the number of requests a client can make in a short time.
  • IP blocking: Block known malicious IPs or ranges.
  • Behavior-based detection: Identify bots by analyzing unusual patterns like high request rates, skipping resources, or accessing non-existent pages.
  • JavaScript challenges: Verify if the client can execute JavaScript, as most bots cannot.
  • CAPTCHAs: Add a CAPTCHA to sensitive areas like login pages or forms. -** Advanced abilities**: Services like Cloudflare or AWS WAF can provide more comprehensive bot protection. Combining these techniques can help create a stronger defense against bots. Thanks again for sharing this—it’s a great resource for anyone looking to get started!

Playwright CLI Flags Tutorial

5 Playwright CLI Flags That Will Transform Your Testing Workflow

  • 0:56 --last-failed: Zero in on just the tests that failed in your previous run
  • 2:34 --only-changed: Test only the spec files you've modified in git
  • 4:27 --repeat-each: Run tests multiple times to catch flaky behavior before it reaches production
  • 5:15 --forbid-only: Prevent accidental test.only commits from breaking your CI pipeline
  • 5:51 --ui --headed --workers 1: Debug visually with browser windows and sequential test execution

Learn how these powerful command-line options can save you time, strengthen your test suite, and streamline your Playwright testing experience. Click on any timestamp above to jump directly to that section in the tutorial!

Watch Full Video 📹️

👋 Kindness is contagious

Please consider leaving a ❤️ or a friendly comment if you found this post helpful!

Okay