這是我的代碼
from bs4 import BeautifulSoup
import requests, lxml
import re
from urllib.parse import urljoin
from googlesearch import search
import pandas as pd
query = 'A M C工程學院,班加羅爾'。
鏈接 = []
for i in search(query, tld='co.in', start=0, stop=1) 。
print(i)
soup = BeautifulSoup(requests.get(i).text, 'lxml'/span>)
for link in soup.select("a[href$='.pdf']") 。
if re.search(r'nirf', str(link), flags=re.IGNORECASE):
fUrl = urljoin(i, link['href'])
print(fUrl)
link.append(fUrl)
print(link)
df = pd.DataFrame(link, columns=['PDF LINKS'] )
print(df)
下面是我運行代碼后的輸出結果:
。
https://www.amcgroup.edu.in/AMCEC/index.php
https://www.amcgroup.edu.in/AMCEC/image/Download/NIRFENGG.pdf
https://www.amcgroup.edu.in/AMCEC/image/download/NIRFMBA.pdf
https://www.amcgroup.edu.in/AMCEC/image/download/NIRF_2019.pdf
https://www.amcgroup.edu.in/AMCEC/image/Download/NIRF_2020.pdf
# 列印帶有鏈接的串列,但得到標簽。
<a href="image/gallery/Swami Vivekananda.pdf" target="_black">For Invitation Click here...</a>
# Dataframe where I want to store list[/span]。
PDF鏈接
0 For Invitation Click here...
我應該得到輸出中顯示的鏈接串列,但是當列印串列時,它給我的是整個標簽而不是鏈接。我還想把我得到的所有鏈接推送到一個單一的資料框架行中,就像這樣:
我想把所有的鏈接推送到一個單一的資料框架中。
PDF LINKS
0 link1 link2 link3 #for query1
1 link1 link2 #for another query[/span
我怎樣才能實作這一點。我的代碼有什么問題,為什么我得到的是標簽而不是串列。 謝謝你。
uj5u.com熱心網友回復:
在for-loop中為list和tag使用不同的變數名:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urljoin
query = "A M C工程學院,班加羅爾"。
all_data = []
for i in ["https://www.amcgroup.edu.in/AMCEC/index.php"] 。
soup = BeautifulSoup(requests.get(i).text, "lxml"/span>)
for link in soup.select("a[href$='.pdf']")。 # <-- `link`和`all_data`在這里是不同的!
if re.search(r "nirf"/span>, link["href"/span>], flags=re.IGNORECASE)
fUrl = urljoin(i, link["href"])
all_data.append(fUrl)
df = pd.DataFrame(all_data, columns=["PDF LINKS"] )
print(df)
列印:
PDF LINKS
0 https://www.amcgroup.edu.in/AMCEC/image/Download/NIRFENGG.pdf
1 https://www.amcgroup.edu.in/AMCEC/image/Download/NIRFMBA.pdf
2 https://www.amcgroup.edu.in/AMCEC/image/Download/NIRF_2019.pdf
3 https://www.amcgroup.edu.in/AMCEC/image/Download/NIRF_2020.pdf
編輯:要把結果放在一行中:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urljoin
query = "A M C工程學院,班加羅爾"。
all_data = []
for i in ["https://www.amcgroup.edu.in/AMCEC/index.php"] 。
soup = BeautifulSoup(requests.get(i).text, "lxml"/span>)
行=[]
for link in soup.select(
"a[href$='.pdf']"。
): # <-- `link`和`all_data`在這里是不同的!
if re.search(r "nirf"/span>, link["href"/span>], flags=re.IGNORECASE)
fUrl = urljoin(i, link["href"])
row.append(fUrl)
if row:
all_data.append(row)
df = pd.DataFrame({"PDF LINKS"/span>: all_data})
print(df)
列印:
PDF LINKS
0 [https://www.amcgroup.edu.in/AMCEC/image/Download/NIRFENGG.pdf, https://www.amcgroup.edu.in/AMCEC/image/Download/NIRFMBA.pdf, https://www.amcgroup.edu.in/AMCEC/image/Download/NIRF_2019.pdf, https://www.amcgroup.edu.in/AMCEC/image/Download/NIRF_2020.pdf]
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/320237.html
標籤:
上一篇:使用Python和BeautifulSoup從URL中列出Excel檔案的名稱
下一篇:如何從網站上抓取描述
