使用htaccess重寫商店搜索頁面-有解無憂

我試圖阻止某些用戶代理訪問搜索頁面，主要是機器人和爬蟲，因為它們最終會增加 CPU 使用率。

當然使用 htaccess 重寫引擎。我目前有這個（一直在嘗試使用我在 SO 和其他地方找到的許多不同的規則組合）

# Block user agents
ErrorDocument 503 "Site temporarily disabled for crawling"
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^.*(bots).*$ [NC]
# RewriteCond %{QUERY_STRING} ^s=(.*)$
# RewriteCond /shop(?:\.php)?\?s=([^\s&] )
RewriteCond %{QUERY_STRING} !(^|&)s=*
RewriteCond %{REQUEST_URI} !^/robots.txt$
# RewriteRule ^shop*$ /? [R=503,L] 
RewriteRule ^shop$ ./$1 [R=503,L]

對許多注釋掉的行感到抱歉 - 正如我所提到的，我一直在嘗試很多不同的東西，但似乎 htaccess 重寫規則不是我的菜。

我想要做的是，如果用戶代理包含“bot”，則回傳 503 錯誤。條件是

用戶代理包含“機器人”——這部分作業正常，我測驗過
如果有一個s查詢字串，其中包含任何內容。
它不是 robots.txt 網址（此時我想我應該洗掉它，甚至不需要）
最后，如果以上匹配重定向/shop/?s=或/shop?s=到 root 并提供 503 錯誤檔案。

uj5u.com熱心網友回復：

使用您顯示的示例/嘗試，請嘗試遵循 htaccess 規則檔案。

在測驗您的 URL 之前，請確保清除瀏覽器快取。

# Block user agents
ErrorDocument 503 "Site temporarily disabled for crawling"
RewriteEngine On
##1st condition here(User agent contains "bots")....
RewriteCond %{HTTP_USER_AGENT} ^.*(bots).*$ [NC]
RewriteRule ^ - [R=503,L]

##2nd condition here(If there is a s query string, with anything in it)...
RewriteCond %{THE_REQUEST} !\s.*robot\.txt\s [NC]
RewriteRule ^ - [R=503,L]

##3rd condition here(query string contains s in it)...
RewriteCond %{THE_REQUEST} \s.*\?(.*s.*)\s [NC]
RewriteRule ^ - [R=503,L]

##4th condition here(match /shop/?s= or /shop?s= and get 503 in those requests)...
RewriteCond %{THE_REQUEST} \shop/?\?s=.*s [NC]
RewriteRule ^ - [R=503,L]

uj5u.com熱心網友回復：

由于您明確定義了決定的標準，因此可以直接實施它們。我理解您的問題，因此必須滿足所有這些標準......

RewriteEngine On
RewriteCond %{ENV:REDIRECT_STATUS} !=503
RewriteCond %{HTTP_USER_AGENT} bots [NC]
RewriteCond %{QUERY_STRING} (?:^|&)s=[^&] (?:&|$)
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteRule ^/?search/?$ - [R=503,L]

不知道為什么你測驗“機器人”而不是“機器人”（你的問題在這方面自相矛盾）。

uj5u.com熱心網友回復：

最后，如果以上匹配重定向/shop/?s=或/shop?s=到 root 并提供 503 錯誤檔案。

您不能“重定向到 root”和“提供 503 錯誤檔案”。重定向是 3xx 回應。您可以在內部將請求重寫為 root并通過定義ErrorDocument 503 /. 然而，這反而違背了 503 的觀點，并且無助于“CPU 使用率”。正如您已經定義的那樣，僅為 503 回應提供靜態字串似乎是最佳選擇。

RewriteRule ^shop$ ./$1 [R=503,L]

當您設定 3xx 范圍以外的代碼時，將完全忽略替換字串（即./$1）。您應該簡單地包含一個連字符 ( -) 作為替換字串，以明確表示“無替換”。該L標志也是多余這里。指定非 3xx 回傳碼時，該L標志是隱含的。

它不是 robots.txt 網址（此時我想我應該洗掉它，甚至不需要）

同意，這個檢查完全是多余的。您已經在檢查請求是否為/shop，因此它不可能/robots.txt也是。（任何請求/robots.txt也不包含查詢字串。）

RewriteCond %{QUERY_STRING} !(^|&)s=*

出于某種原因，您已經否定了（!前綴）條件（也許是為了使其作業？） - 但您需要將其與URL 引數進行肯定匹配s。請注意，正則運算式(^|&)s=*不正確。尾隨*匹配前面的模式 0 次或更多次。所以這個正則運算式將只匹配s或s=或s==和失敗s=foo。

要將sURL 引數與“其中的任何內容”（包括任何內容）匹配，您只需洗掉尾隨*，例如。(^|&)s=. 要將sURL 引數與某個值匹配，然后匹配單個字符，除了&. 例如。(^|&)s=[^&].

用戶代理包含“機器人”——這部分作業正常，我測驗過

上一句中提到的“bots”還是“bot”？我無法想象“機器人”匹配很多機器人，但“機器人”會匹配很多！

RewriteCond %{HTTP_USER_AGENT} ^.*(bots).*$ [NC]

這個正則運算式的作用相當復雜。它還不必要地捕獲了“bots”這個詞（以后不再使用）。正則運算式^.*(bots).*$與簡單的相同bots（沒有捕獲組）。

綜合以上幾點，我們有：

ErrorDocument 503 "Site temporarily disabled for crawling"

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} bots [NC]
RewriteCond %{QUERY_STRING} (?:^|&)s=
RewriteRule ^shop/?$ - [R=503]

上述規則執行以下操作：

匹配 URL 路徑shop或shop/
并且用戶代理字串包含單詞“bots”
并且存在 URL 引數s=（任何地方）
如果以上所有匹配，則提供 503（靜態字串）。

However, I would query whether a 503 is really the correct response. Generally, you don't want bots to crawl internal search results at all; ever. It can be a bottomless pit and wastes crawl budget. Should this perhaps be a 403 instead?

And, are you blocking these URLs in robots.txt already? (And this isn't sufficient to stop the "bad" bots?) If not, I would consider adding the following:

User-agent: *
Disallow: /shop?s=
Disallow: /shop/?s=
Disallow: /shop?*&s=     # If the "s" param can occur anywhere
Disallow: /shop/?*&s=    # (As above)

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/338196.html

標籤：阿帕奇 .htaccess 改写

上一篇：當條件（日期）滿足谷歌表時，將各種活動合并到一個單元格中

下一篇：.htaccess實作域名無埠訪問服務