我遇到了一個問題,我不知道如何進一步解決。
我為公司名稱、地點和省份抓取了多個頁面,以及指向另一個頁面上其他資訊的鏈接。我收集的鏈接提供了另外 3 條我需要的資訊。
我需要訪問該鏈接,取出地址、電話號碼(如果有的話)和 CNAE 代碼,并將其附加到之前的資料中。
我目前擁有的第一次刮擦的作業腳本如下:
import requests
from bs4 import BeautifulSoup
baseurl = ["https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/index.html"]
urls = [f'https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/{i}.html'.format(i) for i in range(2,65)]
allurls = baseurl urls
print(allurls)
for url in allurls:
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
lists = soup.select("div#simulacion_tabla ul")
#scrape the pages
for lis in lists:
title = lis.find('li', class_="col1").text
location = lis.find('li', class_="col2").text
province = lis.find('li', class_="col3").text
link = lis.select("li.col1 a")[0]['href']
info = [title, location, province, link]
print(info)
在第二頁上,資料在一個表中,其 id 名稱如下。這是我認為我需要使用的代碼,但它不起作用,我正在繞圈子試圖找出原因:
section = soup.select("section#datos_empresa")
lslinks = link
for ls in lslinks
location = lis.find('tr', id_="tamano_empresa").text
cnae = lis.find('tr', id_="cnae_codigo_empresa").text
phone = lis.find('tr', id_="telefono_empresa").text
addinfo = [location, cnae, phone]
info.append(addinfo)
這是其中一個鏈接的示例
理想情況下,輸出將是:
['AGRICOLA CALLEJA SL', 'CARPIO', 'VALLADOLID', 'https://www.expansion.com/directorio-empresas/agricola-calleja-sl_1480101_A02_47.html', C/ LA TORRE, 2., 150, 983863247]
我會把它寫到一個文本檔案中,這樣我就可以將它匯入到 Excel 中。
任何幫助將不勝感激!
干杯!
uj5u.com熱心網友回復:
這是迄今為止最小的作業解決方案。
代碼:
import requests
from bs4 import BeautifulSoup
baseurl = ["https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/index.html"]
urls = [f'https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/{i}.html'.format(i) for i in range(2,5)]#range(2,65)]
allurls = baseurl urls
#print(allurls)
data = []
for url in allurls:
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
lists = soup.select("div#simulacion_tabla ul")
#scrape the pages
for lis in lists:
title = lis.find('li', class_="col1").text
location = lis.find('li', class_="col2").text
province = lis.find('li', class_="col3").text
link = lis.select_one("li.col1 a")['href']
#info = [title, location, province, link]
#print(info)
sub_page = requests.get(link)
soup2 = BeautifulSoup(sub_page.content, "html.parser")
direction = soup2.select_one('#direccion_empresa').text
cnae = soup2.select_one('#cnae_codigo_empresa').text
phone=soup2.select_one('#telefono_empresa')
telephoe = phone.text if phone else None
print([title,location,province,link,direction,cnae,telephoe])
#data.append([title, location, province,link, direction, cnae, telephoe])
#cols = ["title", "location", "province","link", "direction", "cnae", "telephoe"]
#df = pd.DataFrame(data, columns=cols)
#print(df)
#df.to_csv('info.csv',index = False)
輸出:
['A CORTI?A DOS ACIVROS SL', 'LUGO', 'LUGO', 'https://www.expansion.com/directorio-empresas/a-cortina-dos-acivros-sl_9163006_A02_27.html', 'CRTA. A CORU?A, 16.', '150', '']
['A CORTI?A DOS ACIVROS SL', 'LUGO', 'LUGO', 'https://www.expansion.com/directorio-empresas/a-cortina-dos-acivros-sl_9163006_A02_27.html', 'CRTA. A CORU?A, 16.', '150', '']
['A P V 19 32 SL', 'VALENCIA', 'VALENCIA', 'https://www.expansion.com/directorio-empresas/a-p-v-19-32-sl_672893_A02_46.html', 'CALLE SALVA, 8 1 2B.', '150', '']
['ABADIA DE JABUGO SL', 'CARTAYA', 'HUELVA', 'https://www.expansion.com/directorio-empresas/abadia-de-jabugo-sl_5442689_A02_21.html', 'URB. MARINA EL ROMPIDO, 31 VILLA M-31. CRTA. EL RO.', '150', '']
['ABALOS REAL SLL', 'CARBONERAS DE GUADAZAON', 'CUENCA', 'https://www.expansion.com/directorio-empresas/abalos-real-sll_1239004_A02_16.html', 'C/ DON CRUZ, 23.', '150', '969142092']
... 很快
uj5u.com熱心網友回復:
在您的子頁面中,您試圖選擇 ID 而不是該部分的類,因此無法匹配任何條目。您也可以使用td.
您的子頁面邏輯需要與您的主頁面相結合。請嘗試以下操作:
import requests
from bs4 import BeautifulSoup
import csv
with open('output.csv', 'w', newline='', encoding='utf-8') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(["Title", "Location", "Province", "Link", "Location", "cnae", "Phone"])
urls = ["https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/index.html"]
urls.extend(f'https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/{i}.html' for i in range(2, 65))
for url in urls:
print(url)
r_main = requests.get(url)
soup_main = BeautifulSoup(r_main.content, "html.parser")
for lis in soup_main.select("div#simulacion_tabla ul"):
title = lis.find('li', class_="col1").text
location = lis.find('li', class_="col2").text
province = lis.find('li', class_="col3").text
link = lis.select("li.col1 a")[0]['href']
print(' ', link)
r_sub = requests.get(link)
soup_sub = BeautifulSoup(r_sub.content, "html.parser")
section = soup_sub.select_one("section.datos_empresa")
location = section.find('td', id="tamano_empresa").text
cnae = section.find('td', id="cnae_codigo_empresa").text
phone = section.find('td', id="telefono_empresa").text
csv_output.writerow([title, location, province, link, location, cnae, phone])
這將創建一個 CSV 輸出檔案,開始:
Title,Location,Province,Link,Location,cnae,Phone
A CORTI?A DOS ACIVROS SL,DESCONOCIDO,LUGO,https://www.expansion.com/directorio-empresas/a-cortina-dos-acivros-sl_9163006_A02_27.html,DESCONOCIDO,150,
A CORTI?A DOS ACIVROS SL,DESCONOCIDO,LUGO,https://www.expansion.com/directorio-empresas/a-cortina-dos-acivros-sl_9163006_A02_27.html,DESCONOCIDO,150,
A P V 19 32 SL,MICROEMPRESA,VALENCIA,https://www.expansion.com/directorio-empresas/a-p-v-19-32-sl_672893_A02_46.html,MICROEMPRESA,150,
ABADIA DE JABUGO SL,DESCONOCIDO,HUELVA,https://www.expansion.com/directorio-empresas/abadia-de-jabugo-sl_5442689_A02_21.html,DESCONOCIDO,150,
ABALOS REAL SLL,MICROEMPRESA,CUENCA,https://www.expansion.com/directorio-empresas/abalos-real-sll_1239004_A02_16.html,MICROEMPRESA,150,969142092
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/361663.html
