我正在制作一個專案,通過 googlesearch 模塊進行 google 搜索,并按頂級域對它們進行排序。我將以 COVID-19 為例。
輸入:
for search in googlesearch.search("COVID-19", lang='en'):
print(search)
輸出:
https://www.cdc.gov/coronavirus/2019-ncov/index.html
https://coronavirus.jhu.edu/map.html
https://www.who.int/emergencies/diseases/novel-coronavirus-2019
https://www.who.int/health-topics/coronavirus
https://www.worldometers.info/coronavirus/
https://en.wikipedia.org/wiki/COVID-19
https://coronavirus.ohio.gov/wps/portal/gov/covid-19/home
https://www.michigan.gov/coronavirus/
https://coronavirus.in.gov/
https://www.osha.gov/coronavirus
https://covid19.nj.gov/
因此該部分有效,我可以更改為輸入:
search_results = []
for search in googlesearch.search("COVID-19", lang='en'):
search_results.append(search)
有一個網站串列。現在,我想使用以下順序對它們進行排序:
[".gov/", ".int/", ".com/", ".edu/", ".org/", ".info/"]
稍后可能會更改順序和/或添加更多域。所以,我希望排序后的版本是:
https://www.cdc.gov/coronavirus/2019-ncov/index.html
https://coronavirus.ohio.gov/wps/portal/gov/covid-19/home
https://www.michigan.gov/coronavirus/
https://coronavirus.in.gov/
https://www.osha.gov/coronavirus
https://covid19.nj.gov/
https://www.who.int/emergencies/diseases/novel-coronavirus-2019
https://www.who.int/health-topics/coronavirus
https://coronavirus.jhu.edu/map.html
https://en.wikipedia.org/wiki/COVID-19
https://www.worldometers.info/coronavirus/
關于我如何做到這一點的任何想法?
uj5u.com熱心網友回復:
一種方法是創建一個域擴展字典以及用于對 URL 進行排序的排名。然后,呼叫sorted一個 lambda 運算式,該運算式從每個 URL 中提取域擴展名并查找排序值。
domains = {"gov" : 1, "int" : 2, "com" : 3, "edu" : 4, "org" : 5, "info" : 6}
urls = ['https://www.cdc.gov/coronavirus/2019-ncov/index.html',
'https://coronavirus.jhu.edu/map.html',
'https://www.who.int/emergencies/diseases/novel-coronavirus-2019',
'https://www.who.int/health-topics/coronavirus',
'https://www.worldometers.info/coronavirus/',
'https://en.wikipedia.org/wiki/COVID-19',
'https://coronavirus.ohio.gov/wps/portal/gov/covid-19/home',
'https://www.michigan.gov/coronavirus/',
'https://coronavirus.in.gov/',
'https://www.osha.gov/coronavirus',
'https://covid19.nj.gov/']
urls = sorted(urls, key=lambda x: domains[re.sub(r'^https?://[^/] \.([^/] )/.*$', r'\1', x)])
print(urls)
這列印:
['https://www.cdc.gov/coronavirus/2019-ncov/index.html', # .gov
'https://coronavirus.ohio.gov/wps/portal/gov/covid-19/home', # .gov
'https://www.michigan.gov/coronavirus/', # .gov
'https://coronavirus.in.gov/', # .gov
'https://www.osha.gov/coronavirus', 'https://covid19.nj.gov/', # .gov
'https://www.who.int/emergencies/diseases/novel-coronavirus-2019', # .int
'https://www.who.int/health-topics/coronavirus', # .int
'https://coronavirus.jhu.edu/map.html', # .edu
'https://en.wikipedia.org/wiki/COVID-19', # .org
'https://www.worldometers.info/coronavirus/'] # .info
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/362889.html
下一篇:使用orderByChild()時addChildEventListener、addValueEventListener和addOnCompleteListener之間的區別?
