我需要從數百萬個 URL 中提取 HOST。一些 URL 格式不正確并回傳 NULL。在許多情況下,我看到大括號 ( {}) 或管道 ( |) 導致問題,其他時候我看到多個哈希 ( #) 字符導致問題。
這是我的代碼,其中包含我需要決議的 URL:
val b = Seq(
("https://example.com/test.aspx?doc={1A23B4C5-67D8-9012-E3F4-A5B67890CD12}"),
("https://example.com/test.aspx?names=John|Peter"),
("https://example.com/#/test.aspx?help=John#top"),
("https://example.com/test.aspx?doc=1A23B4C5-67D8-9012-E3F4-A5B67890CD12"),
).toDF("url_col")
b.createOrReplaceTempView("temp")
spark.sql("SELECT parse_url(`url_col`, 'HOST') as HOST, url_col from temp").show(false)
預期輸出:
----------- ------------------------------------------------------------------------
|HOST |url_col |
----------- ------------------------------------------------------------------------
|example.com|https://example.com/test.aspx?doc={1A23B4C5-67D8-9012-E3F4-A5B67890CD12}|
|example.com|https://example.com/test.aspx?names=John|Peter |
|example.com|https://example.com/#/test.aspx?help=John#top |
|example.com|https://example.com/test.aspx?doc=1A23B4C5-67D8-9012-E3F4-A5B67890CD12 |
----------- ------------------------------------------------------------------------
電流輸出:
----------- ------------------------------------------------------------------------
|HOST |url_col |
----------- ------------------------------------------------------------------------
|null |https://example.com/test.aspx?doc={1A23B4C5-67D8-9012-E3F4-A5B67890CD12}|
|null |https://example.com/test.aspx?names=John|Peter |
|null |https://example.com/#/test.aspx?help=John#top |
|example.com|https://example.com/test.aspx?doc=1A23B4C5-67D8-9012-E3F4-A5B67890CD12 |
----------- ------------------------------------------------------------------------
當 URL 包含無效字符或格式錯誤時,有沒有辦法強制 parse_url 回傳主機?或者,還有更好的方法?
uj5u.com熱心網友回復:
您可以使用regexp_extract函式(regex示例)提取域:
spark.sql("""
SELECT regexp_extract(url_col, "^(?:https?:\/\/)?(?:[^@\n] @)?(?:www.)?([^:\/\n?] )", 1) as HOST,
url_col
FROM temp
""").show(false)
// ----------- ------------------------------------------------------------------------
//|HOST |url_col |
// ----------- ------------------------------------------------------------------------
//|example.com|https://example.com/test.aspx?doc={1A23B4C5-67D8-9012-E3F4-A5B67890CD12}|
//|example.com|https://example.com/test.aspx?names=John|Peter |
//|example.com|https://example.com/#/test.aspx?help=John#top |
//|example.com|https://example.com/test.aspx?doc=1A23B4C5-67D8-9012-E3F4-A5B67890CD12 |
// ----------- ------------------------------------------------------------------------
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/381438.html
