我正在使用以下代碼洗掉一系列 URL:
df1 = pd.DataFrame()
url = 'https://www.welcometothejungle.com/fr/jobs?
page=1&refinementList[profession_name.fr.Tech][]=Data Science'
path = '/Users/jdkj/desktop/chromedriver 3'
options = webdriver.ChromeOptions()
driver = webdriver.Chrome((path), chrome_options=options)
html = driver.get(url)
time.sleep(3)
elems = driver.find_elements_by_xpath("//article/div[2]/header/a['href']")
for elem in elems:
urls = elem.get_attribute("href")
print(urls)
這將回傳我想要看到的正確結果,問題是當我嘗試使用以下代碼將此“url”放入我的空資料框“df1”中時:
df_test = df1.append({'URLS' : urls}, ignore_index = True)
df_test.head()
它沒有向我顯示我想要的網址(它沒有回傳錯誤,但結果沒有意義)
我從 python 開始,所以我想我的問題可能有一些簡單的答案,我希望我很清楚
uj5u.com熱心網友回復:
您的代碼的問題在于您正在覆寫urls變數,然后附加到DataFrame最后一個抓取的 URL。將df1.append陳述句移到for塊內:
df1 = pd.DataFrame()
url = 'https://www.welcometothejungle.com/fr/jobs?
page=1&refinementList[profession_name.fr.Tech][]=Data Science'
path = '/Users/jdkj/desktop/chromedriver 3'
options = webdriver.ChromeOptions()
driver = webdriver.Chrome((path), chrome_options=options)
html = driver.get(url)
time.sleep(3)
elems = driver.find_elements_by_xpath("//article/div[2]/header/a['href']")
for elem in elems:
url = elem.get_attribute("href") # <--- get the url from the <a> tag
df1 = df1.append({'URLS': url}, ignore_index=True) # <--- add the url to the dataframe in the URLS column
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/338301.html
下一篇:如何處理大規模的網頁抓取?
