Nignx网站屏蔽指定路径的百度爬虫

Nginx 搭建的网站，想要屏蔽指定路径下的爬虫。

第一种方法，使用 if 和 Location 指令

可以使用Ningx 的 if 指令和 location 指令实现。
看一个实例，比如我们希望当请求的路径以 hot 开头时爬虫访问返回403。
可以参考如下配置：

{

    server_name  www.example.com;
    index index.php index.html;

    set $sflag 0;
    if ($request_uri ~* ^/hot+) {
        set $sflag "${sflag}1";
    }

    if ($http_user_agent ~* "^$|LWP::Simple|BBBike|Baiduspider|bingbot|Scrapy|Curl|HttpClient|Qihoobot|Yahoo! Slurp|yahoo|Yahoo! Slurp China|YoudaoBot|YodaoBot|Sosospider|sohu-search|so
    gou|Sogou spider|Sogou web spider|MSNBot|ia_archiver|Tomato Bot|robozilla|msnbot|MJ12bot|NHN|Twiceler") {
        set $sflag "${sflag}2";
    }
    if ($sflag = "012") {
        return 403; 
    }

    location ~* ^/hot {
        //...
    }
}

由于nginx 的 if 指令无法嵌套，可以使用间接的方法来实现：

首先设置一个变量 $sflag 为 0
第一个 if 判断，当请求地址为以 hot 开头时，则使 $sflag 后追加 “1”
第二个 if 判断，当请求头UA有爬虫特征时，则使 $sflag 追加 “2”
第三个 if判断最终追加的结果为 “012”（即第一个if 和第二个 if 都成立）返回 403 状态码。 403 状态码即拒绝访问

第二种方法

第二种方法简单粗暴，使用Nignx的 deny 指令。

你可以找出爬虫来源ip的列表，然后逐个deny。（IP 列表可以从你的历史nginx 访问日志中分析）

参考如下示例，在 hot 子目录下，通过 deny 指令屏蔽了百度的蜘蛛爬取。

{

    server_name  www.example.com;
    index index.php index.html;


    location ~* ^/hot {
        deny 220.181.108.0/24; 
        deny 116.179.32.0/24;
    }
}

“220.181.108.0/24” 来源百度的北京机房,
“116.179.32.0/24” 来源于百度的阳泉机房，这两个ip段都是百度的权重爬虫。

转载请注明：大后端 » Nignx网站屏蔽指定路径的百度爬虫

付费咨询

大后端分享与精进

Nignx网站屏蔽指定路径的百度爬虫

第一种方法，使用 if 和 Location 指令

第二种方法

Hi，您需要填写昵称和邮箱！