在 MLLIB 管道中,如何在 Stemmer(來自 Spark NLP)之后鏈接 CountVectorizer(來自 SparkML)?
當我嘗試在管道中同時使用兩者時,我得到:
myColName must be of type equal to one of the following types: [array<string>, array<string>] but was actually of type array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>.
問候,
uj5u.com熱心網友回復:
您需要在 Spark NLP 管道中添加一個 Finisher。試試看:
val documentAssembler =
new DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector =
new SentenceDetector().setInputCols("document").setOutputCol("sentences")
val tokenizer =
new Tokenizer().setInputCols("sentences").setOutputCol("token")
val stemmer = new Stemmer()
.setInputCols("token")
.setOutputCol("stem")
val finisher = new Finisher()
.setInputCols("stem")
.setOutputCols("token_features")
.setOutputAsArray(true)
.setCleanAnnotations(false)
val cv = new CountVectorizer()
.setInputCol("token_features")
.setOutputCol("features")
val pipeline = new Pipeline()
.setStages(
Array(
documentAssembler,
sentenceDetector,
tokenizer,
stemmer,
finisher,
cv
))
val data =
Seq("Peter Pipers employees are picking pecks of pickled peppers.")
.toDF("text")
val model = pipeline.fit(data)
val df = model.transform(data)
輸出:
--------------------------------------------------------------------
|features |
--------------------------------------------------------------------
|(10,[0,1,2,3,4,5,6,7,8,9],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|
--------------------------------------------------------------------
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/311454.html
