我正在嘗試對房地產資料做一些作業,在我自己失敗后設法借用了一個提取一些資料的代碼。不幸的是,我不知道如何決議其余部分,因為 json 格式讓我很困惑。這不是我的專業領域,所以如果有人對如何解決這個問題有任何想法,我將不勝感激。如果需要,我可以發布整個 json,但它很長。
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
import pprint
#-------------------------------------------------------------------------------------------------------------------------#
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome 61.0.3163.100 Safari/537.36',
'Accept': 'text/html,application/xhtml xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'upgrade-insecure-requests': '1'
}
#-------------------------------------------------------------------------------------------------------------------------#
def get_soup(address):
page_request = requests.get(address, headers=HEADERS)
return BeautifulSoup(page_request.text, "lxml")
#-------------------------------------------------------------------------------------------------------------------------#
def fetch_content(soup, verbose=False):
item = soup.select_one("script#hdpApolloPreloadedData").text
d = json.loads(item)['apiCache']
return json.loads(d)
#-------------------------------------------------------------------------------------------------------------------------#
def process_fetched_content(raw_dictionary=None):
if raw_dictionary is not None:
keys = [k for k in raw_dictionary.keys() if k.startswith('VariantQuery{"zpid":')]
property_info = dict((k.split(':')[-1].replace('}',''), raw_dictionary.get(k).get('property', None)) for k in keys)
return property_info
else:
return None
#-------------------------------------------------------------------------------------------------------------------------#
if __name__ == "__main__":
link = 'https://www.zillow.com/homedetails/2408-Comstock-Ct-Naperville-IL-60564/5367006_zpid/'
soup = get_soup(link)
results = process_fetched_content(raw_dictionary = fetch_content(soup, verbose=False))
pprint.pprint(results)
旁注:我知道 zillow 不喜歡抓取,但我不想大規模提取資料,所以不要太擔心。
uj5u.com熱心網友回復:
我相信你可以(除非我理解的問題非常錯誤)
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
import pprint
# -------------------------------------------------------------------------------------------------------------------------#
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome 61.0.3163.100 Safari/537.36",
"Accept": "text/html,application/xhtml xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"upgrade-insecure-requests": "1",
}
# -------------------------------------------------------------------------------------------------------------------------#
def get_soup(address):
page_request = requests.get(address, headers=HEADERS)
return BeautifulSoup(page_request.text, "lxml")
# -------------------------------------------------------------------------------------------------------------------------#
def fetch_content(soup, verbose=False):
item = soup.select_one("script#hdpApolloPreloadedData").text
d = json.loads(item)["apiCache"]
return json.loads(d)
# -------------------------------------------------------------------------------------------------------------------------#
def process_fetched_content(raw_dictionary=None):
if raw_dictionary is not None:
keys = [
k for k in raw_dictionary.keys() if k.startswith('VariantQuery{"zpid":')
]
property_info = dict(
(
k.split(":")[-1].replace("}", ""),
raw_dictionary.get(k).get("property", None),
)
for k in keys
)
return raw_dictionary, property_info
else:
return None
# -------------------------------------------------------------------------------------------------------------------------#
if __name__ == "__main__":
link = "https://www.zillow.com/homedetails/2408-Comstock-Ct-Naperville-IL-60564/5367006_zpid/"
soup = get_soup(link)
raw, results = process_fetched_content(raw_dictionary=fetch_content(soup, verbose=False))
# Traverse through results
for value in results.values():
for inner_key, inner_value in value.items():
print(f'{inner_key}: {inner_value}')
# Traverse through raw dictionary
for key, value in raw.items():
print(f'{key}:')
for inner_key, inner_value in value.items():
print(f'\t{inner_key}:')
try:
for inner_2_key, inner_2_value in inner_value.items():
print(f'\t\t{inner_2_key}:')
try:
for inner_3_key, inner_3_value in inner_2_value.items():
print(f'\t\t\t{inner_3_key}:')
try:
for inner_4_value in inner_3_value:
for inner_4_1_key, inner_4_1_value in inner_4_value.items():
print(f'\t\t\t\t{inner_4_1_key}: {inner_4_1_value}')
except:
for inner_4_key, inner_4_value in inner_3_value.items():
print(f'\t\t\t\t{inner_4_key}: {inner_4_value}')
except:
print(f'\t\t\t{inner_2_value}')
except:
print(f'\t\t{inner_value}')
嘗試這個。輸出太長了。但它現在似乎可讀......
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/431210.html
