昨天 @老灵 QQ 说 Feedly 抓取不到老头博客了,先是一愣,后来想想,可能是前阵子折腾,把一些「垃圾」蜘蛛屏蔽掉,把 Feedly 误伤。
找到之前添加的代码,果然,Feedly、FeedDemon 都被干掉了,囧。目前在用代码,丢到 WordPress 主题 functions.php 文件即可。(php7.3 实测可用,低版本没有测试)
if(!is_admin()) {
add_action('init', 'deny_mirrored_request', 0);
}
function deny_mirrored_request()
{
//获取UA信息
$ua = isset($_SERVER['HTTP_USER_AGENT']) ? $_SERVER['HTTP_USER_AGENT'] : '';
//将恶意USER_AGENT存入数组
$now_ua = array('BOT/0.1 (BOT for JCE)','CrawlDaddy','Java','UniversalFeedParser','ApacheBench','Swiftbot','ZmEu','Indy Library','oBot','jaunty','YandexBot','AhrefsBot','MJ12bot','WinHttp','EasouSpider','HttpClient','Microsoft URL Control','jaunty','Python-urllib','lightDeckReports Bot');
//禁止空USER_AGENT,dedecms等主流采集程序都是空USER_AGENT,部分sql注入工具也是空USER_AGENT
if( ( empty( $ua ) ) || preg_match('/PHP/i', $ua)) {
header("Content-type: text/html; charset=utf-8");
wp_die('请勿采集本站,因为采集的站长木有小JJ!');
} else {
foreach($now_ua as $value ) {
//判断是否是数组中存在的UA
if( preg_match( '~'.$value.'~i', $ua) ) {
header("Content-type: text/html; charset=utf-8");
wp_die('请勿采集本站,因为采集的站长木有小JJ!');
}
}
}
}
使用 curl 模拟,比如:curl -I -A '' https://cyhour.com 模拟空 UA 访问
[root@host ~]# curl -I -A '' https://cyhour.com
HTTP/1.1 500 Internal Server Error
Server: nginx
Date: Tue, 30 Jul 2019 01:50:57 GMT
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, max-age=0
[root@host ~]# curl -I -A 'php' https://cyhour.com
HTTP/1.1 500 Internal Server Error
Server: nginx
Date: Tue, 30 Jul 2019 01:51:07 GMT
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, max-age=0
[root@host ~]# curl -I -A 'Googlebot' https://cyhour.com
HTTP/1.1 200 OK
Server: nginx
Date: Tue, 30 Jul 2019 01:55:26 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Vary: Accept-Encoding
Link: <https://cyhour.com/wp-json/>; rel="https://api.w.org/"
Strict-Transport-Security: max-age=15768000
参考资料:张戈博客 - https://zhang.ge/5101.html、https://zhang.ge/4458.html
本文首发于:不小心把 Feedly 屏蔽掉-垃圾站
很多网站都是屏蔽的,说是防止被采集
@张波博客 额,可以防止被采集,但是也会屏蔽掉部分正常用户。