我在緯度和經度中有 NULL,我需要通過 API 搜索地址然后替換 NULL 值。
我怎么能那樣做?如何遍歷每一行并提取城市名稱,然后將其傳遞給 API?
------------- -------------------- ------- ------------ -------------------- -------- ---------
| Id| Name|Country| City| Address|Latitude|Longitude|
------------- -------------------- ------- ------------ -------------------- -------- ---------
| 42949672960|Americana Resort ...| US| Dillon| 135 Main St| null| null|
| 60129542147|Ubaa Old Crawford...| US| Des Plaines| 5460 N River Rd| null| null|
| 455266533383| Busy B Ranch| US| Jefferson| 1100 W Prospect Rd| null| null|
|1108101562370| Motel 6| US| Rockport| 106 W 11th St| null| null|
|1382979469315| La Quinta| US| Twin Falls| 539 Pole Line Rd| null| null|
| 292057776132| Hyatt Dulles| US| Herndon|2300 Dulles Corne...| null| null|
| 987842478080| Dead Broke Inn| US| Young|47893 N Arizona H...| null| null|
| 300647710720|The Miner's Inn M...| US| Viburnum|Highway 49 Saint ...| null| null|
| 489626271746| Alyssa's Motel| US| Casco| Rr 302| null| null|
------------- -------------------- ------- ------------ -------------------- -------- ---------
uj5u.com熱心網友回復:
您可以創建一個 UDF:
首先定義一個函式,給定地址回傳緯度和經度(例如,連接為如下字串:“緯度,經度”)
def get_latitude_longitude(address):
#call your api with address as parameter
#concat the latitude and longitude that the api call returns and return it
return lat_longitude_concat
然后將該函式注冊為用戶定義的函式
from pyspark.sql.functions import udf
get_latitude_longitude_UDF = udf(lambda z: get_latitude_longitude(z))
最后呼叫UDF,并將輸出分成兩列
from pyspark.sql.functions import col
df.withColumn('tmp', get_latitude_longitude_UDF('Address'))
.withColumn('Latitude', split(df['tmp'], ',').getItem(0))
.withColumn('Longitude', split(df['tmp'], ',').getItem(1))
.drop("tmp")
uj5u.com熱心網友回復:
同意 UDF 上的 @SCouto,但我建議回傳一個元組而不是逗號分隔的字串。這將在以后節省兩個額外的拆分轉換。
def get_latitude_longitude(address):
#call your api with address as parameter
#concat the latitude and longitude that the api call returns and return it
return (lat, lon)
from pyspark.sql import functions as F
from pyspark.sql import types as T
get_latitude_longitude_UDF = F.udf(get_latitude_longitude, T.ArrayType(T.DoubleType()))
(df
.withColumn('latlon', get_latitude_longitude_UDF('Address'))
.withColumn('lat', df['latlon'][0])
.withColumn('lon', df['latlon'][1])
)
轉載請註明出處,本文鏈接:https://www.uj5u.com/qukuanlian/345663.html
