我目前有一個接收 url 字串的函式,讀取它以查找 x 資訊并將其存盤為 json 檔案:
def log_scrape(url):
HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246'}
response = requests.get(url=url, headers=HEADERS)
soup = BeautifulSoup(response.content, 'html.parser')
data = soup.find_all('script')[8]
dataString = data.text.rstrip()
logData = re.findall(r'{.*}', dataString)
try:
urlLines = url.split('/')
if len(urlLines) < 5:
bossName = urlLines[3]
elif len(urlLines) == 5:
bossName = urlLines[4]
except Exception as e:
return 'Error' str(e)
tag = bossName.split('_')
bossTag = tag[1]
try:
# Wing_1
if bossTag == 'vg':
pathName = 'ETL\EXTRACT_00\Web Scraping\Boss_data\Wing_1\Valley_Guardian'
elif bossTag == 'gors':
pathName = 'ETL\EXTRACT_00\Web Scraping\Boss_data\Wing_1\Gorseval_The_Multifarious'
elif bossTag == 'sab':
pathName = 'ETL\EXTRACT_00\Web Scraping\Boss_data\Wing_1\Sabetha'
# Wing_2
elif bossTag == 'sloth':
pathName = 'ETL\EXTRACT_00\Web Scraping\Boss_data\Wing_2\Slothasor'
elif bossTag == 'matt':
pathName = 'ETL\EXTRACT_00\Web Scraping\Boss_data\Wing_2\Mathias'
# Wing_3
elif bossTag == 'kc':
pathName = 'ETL\EXTRACT_00\Web Scraping\Boss_data\Wing_3\Keep_Construct'
elif bossTag == 'xera':
pathName = 'ETL\EXTRACT_00\Web Scraping\Boss_data\Wing_3\Xera'
# Wing_4
elif bossTag == 'cairn':
pathName = 'ETL\EXTRACT_00\Web Scraping\Boss_data\Wing_4\Cairn_The_Indomitable'
elif bossTag == 'mo':
pathName = 'ETL\EXTRACT_00\Web Scraping\Boss_data\Wing_4\Mursaat_Overseer'
elif bossTag == 'sam':
pathName = 'ETL\EXTRACT_00\Web Scraping\Boss_data\Wing_4\Samarog'
elif bossTag == 'dei':
pathName = 'ETL\EXTRACT_00\Web Scraping\Boss_data\Wing_4\Deimos'
# Wing_5
elif bossTag == 'sh':
pathName = 'ETL\EXTRACT_00\Web Scraping\Boss_data\Wing_5\Soulless_Horror_Deesmina'
elif bossTag == 'dhuum':
pathName = 'ETL\EXTRACT_00\Web Scraping\Boss_data\Wing_5\Dhuum'
# Wing_6
elif bossTag == 'ca':
pathName = 'ETL\EXTRACT_00\Web Scraping\Boss_data\Wing_6\Conjured_Amalgamate'
elif bossTag == 'twinlargos':
pathName = 'ETL\EXTRACT_00\Web Scraping\Boss_data\Wing_6\Twin_Largos'
elif bossTag == 'qadim':
pathName = 'ETL\EXTRACT_00\Web Scraping\Boss_data\Wing_6\Qadim'
# Wing_7
elif bossTag == 'adina':
pathName = 'ETL\EXTRACT_00\Web Scraping\Boss_data\Wing_7\Cardinal_Adina'
elif bossTag == 'sabir':
pathName = 'ETL\EXTRACT_00\Web Scraping\Boss_data\Wing_7\Cardinal_Sabir'
elif bossTag == 'prlqadim' or bossTag == 'qpeer':
pathName = 'ETL\EXTRACT_00\Web Scraping\Boss_data\Wing_7\Qadim_The_Peerless'
except:
pathName = 'ETL\EXTRACT_00\Web Scraping\Boss_data'
with open(f'{pathName}\{bossName}.json', 'w') as f:
for line in logData:
jsonFile = f.write(line)
return jsonFile
pass
然而,這使得這個程序很慢,所以我想嘗試使用一個 txt 檔案,回圈它并運行 de 函式,txt 檔案如下所示:
https://gw2wingman.nevermindcreations.de/logContent/20220829-151336_matt_kill
https://gw2wingman.nevermindcreations.de/logContent/20220831-214520_sabir_kill
https://gw2wingman.nevermindcreations.de/logContent/20220831-190128_sabir_kill
我嘗試使用 for 回圈:
with open('gw2_urls.txt', 'r') as urls:
for url in urls:
print(log_scrape(url))
但它總是在“data = soup.find_all('script')[8]”行中回傳錯誤“List out of index”,但是,如果我一一執行此操作,則不會出現此錯誤。
如果您知道為什么會發生這種情況以及如何加快此程序,那將非常有幫助。
uj5u.com熱心網友回復:
使用 python 讀取文本檔案中的行的正確方法是:
with open('gw2_urls.txt', 'r') as f:
urls = f.readlines()
for url in urls:
print(log_scrape(url))
有關詳細資訊,readlines()
請參閱https://www.w3schools.com/python/ref_file_readlines.asp
uj5u.com熱心網友回復:
如果我明白了,你想要資料中的鏈接嗎?你似乎只從中得到一個soup.find_all('script')[8]
,那就是它是否存在。這是標記腳本的所有元素的串列。<a>
和href
屬性的示例:
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
這可以更改為
log_data = [a.get('href') for a in soup.find_all('a')]
然后將其列印到檔案中,如下所示:
with open('gw2_urls.txt', 'w') as urls:
for url in urls:
urls.write(log_data "\n")
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/507421.html