對網路抓取非常陌生,并試圖為自己做一個專案,我在這里從 MLB 前 100 名前景網站上抓取名稱串列:https ://www.mlb.com/prospects/top100/
目前,在我加載 HTML 代碼后,我的代碼如下所示(盡管我使用了各種不同的技術):
***from bs4 import BeautifulSoup
import requests
#### Parse the html content
soup = BeautifulSoup(html, "lxml")
#### Find all name tags:
prospects = soup.find_all("div.prospect-heashot__name")
#### Iterate through all name tags
for prospect in prospects:
#### Get text from each tag
print(prospect.text)***
最終結果應類似于:
Francisco Alvarez
Gunnar Henderson
Corbin Carroll
Grayson Rodriguez
Anthony Volpe
etc
任何幫助將不勝感激!
uj5u.com熱心網友回復:
這是一個有趣的問題 :) 資料以 Json 形式存盤在頁面內。您可以使用json模塊決議它,然后在嵌套字典中搜索相關資料(我使用遞回來完成任務):
import re
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.mlb.com/prospects/top100/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.select_one("[data-init-state]")["data-init-state"])
pat1 = re.compile(r"Player:\d $")
pat2 = re.compile(r"getProspectRankings.*\)\.\d $")
def get_data(o, pat):
if isinstance(o, dict):
for k, v in o.items():
if pat.search(k):
yield k, v
else:
yield from get_data(v, pat)
elif isinstance(o, list):
for v in o:
yield from get_data(v, pat)
players = {}
for k, v in get_data(data, pat1):
players[k] = v["useName"], v["boxscoreName"]
rankings = []
for k, v in get_data(data, pat2):
rankings.append((v["rank"], players[v["player"]["id"]]))
for rank, (name, surname) in sorted(rankings):
print("{:>03}. {:<15} {:<15}".format(rank, name, surname))
印刷:
001. Francisco álvarez, F
002. Gunnar Henderson
003. Corbin Carroll
004. Grayson Rodriguez, G
005. Anthony Volpe
006. Jordan Walker
007. Marcelo Mayer
008. Diego Cartaya
009. Eury Pérez
010. Jackson Chourio
011. Druw Jones
012. Jordan Lawlar
013. Jackson Holliday
014. Elly De La Cruz
015. Daniel Espino
016. Marco Luciano
017. Noelvi Marte, N
018. Brett Baty
019. Henry Davis
020. Taj Bradley
021. Kyle Harrison
022. Robert Hassell III
023. Zac Veen
024. Andrew Painter
025. Triston Casas
026. Bobby Miller
027. Ezequiel Tovar
028. Elijah Green
029. Termarr Johnson
030. Pete Crow-Armstrong
031. George Valera
032. Brooks Lee
033. Ricky Tiedemann
034. James Wood
035. Curtis Mead
036. Josh Jung
037. Kevin Parada
038. Jackson Jobe
039. Jasson Domínguez
040. Colton Cowser
041. Miguel Vargas, M
042. Michael Busch
043. Max Meyer
044. Quinn Priester
045. Jack Leiter
046. Sal Frelick
047. Tyler Soderstrom
048. Brennen Davis, B
049. Jacob Berry
050. Oswald Peraza
051. Masyn Winn
052. Edwin Arroyo
053. Gavin Williams
054. Mick Abel
055. Cade Cavalli
056. Evan Carter
057. Colson Montgomery
058. Royce Lewis
059. Owen White
060. Cam Collier
061. Adael Amador
062. Liover Peguero
063. Drew Romo
064. Logan O'Hoppe
065. Harry Ford
066. Andy Pages
067. Ken Waldichuk
068. Hunter Brown, H
069. Brayan Rocchio
070. Orelvis Martinez
071. Jace Jung
072. Gavin Cross
073. Matt McLain
074. Ryan Pepiot
075. Bo Naylor, B
076. Jordan Westburg
077. Gavin Stone
078. Justin Foscue
079. Gordon Graceffo
080. Matthew Liberatore
081. Carson Williams
082. Austin Wells
083. Jackson Merrill
084. Joey Wiemer
085. Alex Ramirez
086. Kevin Alcantara
087. DL Hall, DL
088. Alec Burleson
089. Brock Porter
090. Brandon Pfaadt
091. Tink Hence
092. Emmanuel Rodriguez, Em
093. Nick Gonzales, N
094. Zack Gelof
095. Oscar Colas
096. Ceddanne Rafaela
097. Endy Rodriguez, E
098. Dylan Lesko
099. Tanner Bibee
100. Wilmer Flores
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/518200.html
標籤:Python网页抓取
上一篇:Web使用beautifulSoup和Python抓取元素
下一篇:Python/Pandas:如何將bs4.element.ResultSet轉換為PandasDataFrame?
