• drkt@scribe.disroot.org
    link
    fedilink
    English
    arrow-up
    32
    ·
    9 days ago

    I am currently watching several malicious crawlers be stuck in a 404 hole I created. Check it out yourself at https://drkt.eu/asdfasd

    I respond to all 404s with a 200 and then serve them that page full of juicy bot targets. A lot of bots can’t get out of it and I’m hoping that the driveby bots that look for login pages simply mark it (because it responded with 200 instead of 404) so a real human has to go and check and waste their time.

    • Daniel Quinn@lemmy.ca
      link
      fedilink
      English
      arrow-up
      7
      ·
      9 days ago

      This is pretty slick, but doesn’t this just mean the bots hammer your server looping forever? How much processing do you do of those forms for example?

      • drkt@scribe.disroot.org
        link
        fedilink
        English
        arrow-up
        7
        ·
        9 days ago

        doesn’t this just mean the bots hammer your server looping forever?

        Yes

        How much processing do you do of those forms

        None

        It costs me nothing to have bots spending bandwidth on me because I’m not on a metered connection and electricity is cheap enough that the tiny overhead of processing their requests might amount to a dollar or two per year.

      • jagged_circle@feddit.nl
        link
        fedilink
        English
        arrow-up
        4
        ·
        9 days ago

        Best is to redirect them to a 1TB file served by hetzner’s cache. There’s some nginx configs that do this

  • r00ty@kbin.life
    link
    fedilink
    arrow-up
    17
    ·
    10 days ago

    If you’re running nginx I am using the following:

    if ($http_user_agent ~* "SemrushBot|Semrush|AhrefsBot|MJ12bot|YandexBot|YandexImages|MegaIndex.ru|BLEXbot|BLEXBot|ZoominfoBot|YaK|VelenPublicWebCrawler|SentiBot|Vagabondo|SEOkicks|SEOkicks-Robot|mtbot/1.1.0i|SeznamBot|DotBot|Cliqzbot|coccocbot|python|Scrap|SiteCheck-sitecrawl|MauiBot|Java|GumGum|Clickagy|AspiegelBot|Yandex|TkBot|CCBot|Qwantify|MBCrawler|serpstatbot|AwarioSmartBot|Semantici|ScholarBot|proximic|GrapeshotCrawler|IAScrawler|linkdexbot|contxbot|PlurkBot|PaperLiBot|BomboraBot|Leikibot|weborama-fetcher|NTENTbot|Screaming Frog SEO Spider|admantx-usaspb|Eyeotabot|VoluumDSP-content-bot|SirdataBot|adbeat_bot|TTD-Content|admantx|Nimbostratus-Bot|Mail.RU_Bot|Quantcastboti|Onespot-ScraperBot|Taboolabot|Baidu|Jobboerse|VoilaBot|Sogou|Jyxobot|Exabot|ZGrab|Proximi|Sosospider|Accoona|aiHitBot|Genieo|BecomeBot|ConveraCrawler|NerdyBot|OutclicksBot|findlinks|JikeSpider|Gigabot|CatchBot|Huaweisymantecspider|Offline Explorer|SiteSnagger|TeleportPro|WebCopier|WebReaper|WebStripper|WebZIP|Xaldon_WebSpider|BackDoorBot|AITCSRoboti|Arachnophilia|BackRub|BlowFishi|perl|CherryPicker|CyberSpyder|EmailCollector|Foobot|GetURL|httplib|HTTrack|LinkScan|Openbot|Snooper|SuperBot|URLSpiderPro|MAZBot|EchoboxBot|SerendeputyBot|LivelapBot|linkfluence.com|TweetmemeBot|LinkisBot|CrowdTanglebot|ClaudeBot|Bytespider|ImagesiftBot|Barkrowler|DataForSeoBo|Amazonbot|facebookexternalhit|meta-externalagent|FriendlyCrawler|GoogleOther|PetalBot|Applebot") { return 403; }

    That will block those that actually use recognisable user agents. I add any I find as I go on. It will catch a lot!

    I also have a huuuuuge IP based block list (generated by adding all ranges returned from looking up the following AS numbers):

    AS45102 (Alibaba cloud) AS136907 (Huawei SG) AS132203 (Tencent) AS32934 (Facebook)

    Since these guys run or have run bots that impersonate real browser agents.

    There are various tools online to return prefix/ip lists for an autonomous system number.

    I put both into a single file and include it into my web site config files.

    EDIT: Just to add, keeping on top of this is a full time job! EDIT 2: Removed Mojeek bot as it seems to be a normal web crawler.

    • ctag@lemmy.sdf.orgOP
      link
      fedilink
      English
      arrow-up
      5
      ·
      9 days ago

      Thank you for the detailed reply.

      keeping on top of this is a full time job!

      I guess that’s why I’m interested in a tooling based solution. My selfhosting is small-fry junk, but a lot of others like me are hosting entire fedi communities or larger websites.

      • r00ty@kbin.life
        link
        fedilink
        arrow-up
        5
        ·
        9 days ago

        Yeah, I probably should look to see if there’s any good plugins that do this on some community submission basis. Because yes, it’s a pain to keep up with whatever trick they’re doing next.

        And unlike web crawlers that generally check a url here and there, AI bots absolutely rip through your sites like something rabid.

        • Admiral Patrick@dubvee.org
          link
          fedilink
          English
          arrow-up
          3
          ·
          9 days ago

          AI bots absolutely rip through your sites like something rabid.

          SemrushBot being the most rabid from my experience. Just will not take “fuck off” as an answer.

          That looks pretty much like how I’m doing it, also as an include for each virtual host. The only difference is I don’t even bother with a 403. I just use Nginx’s 444 “response” to immediately close the connection.

          Are you doing the IP blocks also in Nginx or lower at the firewall level? Currently I’m doing it at firewall level since many of those will also attempt SSH brute forces (good luck since I only use keys, but still…)

          • r00ty@kbin.life
            link
            fedilink
            arrow-up
            4
            ·
            9 days ago

            So on my mbin instance, it’s on cloudflare. So I filter the AS numbers there. Don’t even reach my server.

            On the sites that aren’t behind cloudflare. Yep it’s on the nginx level. I did consider firewall level. Maybe just make a specific chain for it. But since I was blocking at the nginx level I just did it there for now. I mean it keeps them off the content, but yes it does tell them there’s a website there to leech if they change their tactics for example.

            You need to block the whole ASN too. Those that are using chrome/firefox UAs change IP every 5 minutes from a random other one in their huuuuuge pools.

    • Atemu@lemmy.ml
      link
      fedilink
      English
      arrow-up
      2
      ·
      9 days ago

      I’d suspect the bots would just try again with a masked user agent when they receive a 403.

      I think the best strategy would be to feed the bots shit that looks like real content.

  • nothacking@discuss.tchncs.de
    link
    fedilink
    English
    arrow-up
    9
    arrow-down
    1
    ·
    10 days ago

    Perhaps feed the convincing fake data so they don’t realize they’ve been IP banned/used agent filtered.

  • Deckweiss@lemmy.world
    link
    fedilink
    English
    arrow-up
    8
    arrow-down
    1
    ·
    edit-2
    9 days ago

    The only way I can think of is blacklisting everything by default, directing to a challanging proper captcha (can be selfhosted) and temporarily whitelisting proven human IPs.

    When you try to “enumerate badness” and block all AI useragents and IP ranges, you’ll always leave some new ones through and you’ll never be done with adding them.

    Only allow proven humans.


    A captcha will inconvenience the users. If you just want to make it worse for the crawlers, let them spend compute ressources through something like https://altcha.org/ (which would still allow them to crawl your site, but make DDoSing very expensive) or AI honeypots.

  • Scrubbles@poptalk.scrubbles.tech
    link
    fedilink
    English
    arrow-up
    6
    ·
    10 days ago

    If I’m reading your link right, they are using user agents. Granted there’s a lot. Maybe you could whitelist user agents you approve of? Or one of the commenters had a list that you could block. Nginx would be able to handle that.

    • ctag@lemmy.sdf.orgOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      10 days ago

      Thank you for the reply, but at least one commenter claims they’ll impersonate Chrome UAs.

          • ctag@lemmy.sdf.orgOP
            link
            fedilink
            English
            arrow-up
            7
            ·
            9 days ago

            In the hackernews comments for that geraspora link people discussed websites shutting down due to hosting costs, which may be attributed in part to the overly aggressive crawling. So maybe it’s just a different form of DDOS than we’re used to.

  • WasPentalive@lemmy.one
    link
    fedilink
    English
    arrow-up
    3
    ·
    edit-2
    9 days ago

    When one of these guys attacks your site, do they send the info back to the spoofed address or does the scraped info go to their real IP address? Is there some way to get a fix on the actual bot and not on some home user that got his network facing IP address hijacked?