我正在嘗試從 URL 中提取域。
輸入:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val b = Seq(
("subdomain.example.com/test.php"),
("example.com"),
("example.buzz"),
("test.example.buzz"),
("subdomain.example.co.uk"),
).toDF("raw_url")
var c = b.withColumn("host", callUDF("parse_url", $"raw_url", lit("HOST"))).show()
預期成績:
-------------------------------- ---------------
| raw_url | host |
-------------------------------- ---------------
| subdomain.example.com/test.php | example.com |
| example.com | example.com |
| example.buzz | example.buzz |
| test.example.buzz | example.buzz |
| subdomain.example.co.uk | example.co.uk |
------------------------------- ---------------
非常感謝任何建議。
編輯:根據@AlexOtt 的提示,我已經接近了幾步。
import com.google.common.net.InternetDomainName
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val b = Seq(
("subdomain.example.com/test.php"),
("example.com"),
("example.buzz"),
("test.example.buzz"),
("subdomain.example.co.uk"),
).toDF("raw_url")
var c = b.withColumn("host", callUDF("InternetDomainName.from", $"raw_url", topPrivateDomain)).show()
但是,我顯然沒有使用 withColumn 正確實作它。這是錯誤:
錯誤:未找到:值 topPrivateDomain var c = b.withColumn("host", callUDF("InternetDomainName.from", $"raw_url", topPrivateDomain)).show()
編輯2:
從@sarveshseri 那里得到了一些很好的指標,在清理了一些語法錯誤之后,下面的代碼能夠從大多數 URL 中洗掉子域。
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import com.google.common.net.InternetDomainName
import java.net.URL
val b = Seq(
("subdomain.example.com/test.php"),
("example.com"),
//("example.buzz"),
//("test.example.buzz"),
("subdomain.example.co.uk"),
).toDF("raw_url")
val hostExtractUdf = org.apache.spark.sql.functions.udf {
(urlString: String) =>
val url = new URL("https://" urlString)
val host = url.getHost
InternetDomainName.from(host).topPrivateDomain().name()
}
var c = b.select("raw_url").withColumn("HOST",
hostExtractUdf(col("raw_url")))
.show(false)
但是,它仍然無法按預期作業。較新的后綴像.buzz和.site和.today會導致以下錯誤:
Caused by: java.lang.IllegalStateException: Not under a public suffix: example.buzz
uj5u.com熱心網友回復:
首先,您需要guava在build.sbt.
libraryDependencies = "com.google.guava" % "guava" % "31.0.1-jre"
現在您可以按如下方式提取主機,
import com.google.common.net.InternetDomainName
import org.apache.sedona.core.serde.SedonaKryoRegistrator
import org.apache.spark.SparkConf
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import java.net.URL
import spark.implicits._
val hostExtractUdf = org.apache.spark.sql.functions.udf { (urlString: String) =>
val url = new URL("https://" urlString)
val host = url.getHost
InternetDomainName.from(host).topPrivateDomain().toString
}
val b = sc.parallelize(Seq(
("a.b.com/c.php"),
("a.b.site/c.php"),
("a.b.buzz/c.php"),
("a.b.today/c.php"),
("b.com"),
("b.site"),
("b.buzz"),
("b.today"),
("a.b.buzz"),
("a.b.co.uk"),
("a.b.site")
)).toDF("raw_url")
val c = b.withColumn("HOST", hostExtractUdf(col("raw_url")))
c.show()
c.show 輸出
--------------- -------
| raw_url| HOST|
--------------- -------
| a.b.com/c.php| b.com|
| a.b.site/c.php| b.site|
| a.b.buzz/c.php| b.buzz|
|a.b.today/c.php|b.today|
| b.com| b.com|
| b.site| b.site|
| b.buzz| b.buzz|
| b.today|b.today|
| a.b.buzz| b.buzz|
| a.b.co.uk|b.co.uk|
| a.b.site| b.site|
--------------- -------
uj5u.com熱心網友回復:
也許您可以將正則運算式與 Sparkregexp_extract和regexp_replace內置函式一起使用。下面是一個例子:
val c = b.withColumn(
"HOST",
regexp_extract(col("raw_url"), raw"^(?:https?:\/\/)?(?:[^@\n] @)?(?:www.)?([^:\/\n?] )", 1)
).withColumn(
"sub_domain",
regexp_extract(col("HOST"), raw"(.*?)\.(?=[^\/]*\..{2,5})/?.*", 1)
).withColumn(
"HOST",
expr("trim(LEADING '.' FROM regexp_replace(HOST, sub_domain, ''))")
).drop("sub_domain")
c.show(false)
// ----------------------------------- -------------
//|raw_url |HOST |
// ----------------------------------- -------------
//|subdomain.example.com/test.php |example.com |
//|example.com |example.com |
//|example.buzz |example.buzz |
//|test.example.buzz |example.buzz |
//|https://www.subdomain.example.co.uk|example.co.uk|
//|subdomain.domain.buzz |domain.buzz |
//|dev.example.today |example.today|
// ----------------------------------- -------------
第一個從 URL(包括子域)中提取完整的主機名。然后,使用取自這個答案的正則運算式,我們搜索子域并將其替換為空白。
沒有針對所有可能的情況對其進行測驗,但它適用于您問題中的給定示例。
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/387357.html
