我已經下載了Aminar DBLP Version 11的文章語料庫。語料庫是一個巨大的文本檔案(12GB),其中每一行都是一個自包含的 JSON 字串:
'{"id": "100001334", "title": "Ontologies in HYDRA - Middleware for Ambient Intelligent Devices.", "authors": [{"name": "Peter Kostelnik", "id": "2702511795"}, {"name": "Martin Sarnovsky", "id": "2041014688"}, {"name": "Jan Hreno", "id": "2398560122"}], "venue": {"raw": "AMIF"}, "year": 2009, "n_citation": 2, "page_start": "43", "page_end": "46", "doc_type": "", "publisher": "", "volume": "", "issue": "", "fos": [{"name": "Lernaean Hydra", "w": 0.4178039}, {"name": "Database", "w": 0.4269269}, {"name": "World Wide Web", "w": 0.415332377}, {"name": "Ontology (information science)", "w": 0.459045082}, {"name": "Computer science", "w": 0.399807781}, {"name": "Middleware", "w": 0.5905041}, {"name": "Ambient intelligence", "w": 0.5440575}]}'
所有 JSON 字串都以換行符分隔。
當我使用 PySpark 打開檔案時,它會回傳一個資料框,其中有一列包含 JSON 字串:
df = spark.read.text(path_to_data)
df.show()
--------------------
| value|
--------------------
|{"id": "100001334...|
|{"id": "100001888...|
|{"id": "100002270...|
|{"id": "100004108...|
|{"id": "10000571"...|
|{"id": "100007563...|
|{"id": "100008278...|
|{"id": "100008490...|
我需要訪問 JSON 欄位來構建我的深度學習模型。
我的第一次嘗試是嘗試使用此問題中提到的 JSON 方法打開檔案:
df = spark.read.option("wholeFile", True).option("mode", "PERMISSIVE").json(path_to_data)
但是所有提出的解決方案都需要很長時間才能運行(超過 3 小時),但沒有顯示任何結果。
我的第二次嘗試是嘗試從 JSON 字串決議 JSON 物件以獲取包含如下列的資料框:
df = spark.read.text(path_to_data)
schema = StructType([StructField("id", StringType()), StructField("title", StringType()), StructField("authors", ArrayType(MapType(StringType(), StringType()))), StructField("venue", MapType(StringType(), StringType()), True), StructField("year", IntegerType(), True), StructField("keywords", ArrayType(StringType()), True), StructField("references", ArrayType(StringType()), True), StructField("n_citation", IntegerType(), True), StructField("page_start", StringType(), True), StructField("page_end", StringType(), True), StructField("doc_type", StringType(), True), StructField("lang", StringType(), True), StructField("publisher", StringType(), True), StructField("volume", StringType(), True), StructField("issue", StringType(), True), StructField("issn", StringType(), True), StructField("isbn", StringType(), True), StructField("doi", StringType(), True), StructField("pdf", StringType(), True), StructField("url", ArrayType(StringType()), True),
StructField("abstract", StringType(), True), StructField("indexed_abstract", StringType(), True)])
datajson = df.withColumn("jsonData", from_json(col("value"),schema)).select("jsonData.*")
But it returned the exception "cannot resolve column due to data type mismatch PySpark", even though the data types of each field in the schema are true (based on the official website of corpus here)
My third attempt was trying to parse the JSON string to Map data type:
casted = df.withColumn("value", from_json(df.value, MapType(StringType(),StringType())))
It gave me the following result:
root
|-- value: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
--------------------
| value|
--------------------
|{id -> 100001334,...|
|{id -> 1000018889...|
|{id -> 1000022707...|
|{id -> 100004108,...|
|{id -> 10000571, ...|
|{id -> 100007563,...|
|{id -> 100008278,...|
Now, each row is a valid JSON object which can be accessed as follows:
row = casted.first()
row.value['id']
row.value['title']
row.value['authors']
Now, my question is how to convert this dataframe of one column named 'value' to a dataframe with the columns mentioned above (id, title, authors, etc) based on JSON objects?
uj5u.com熱心網友回復:
在不提供架構的情況下讀取檔案需要更長的時間。我試圖將巨大的檔案拆分成更小的塊以了解架構,但由于在資料架構中發現重復列而失敗:
我使用提供的模式在同一資料集上嘗試了以下方法,并且它有效。
from pyspark.sql.types import StructType,StructField,StringType,IntegerType,ArrayType
schema = StructType([StructField("id", StringType()), StructField("title", StringType()), StructField("authors", ArrayType(MapType(StringType(), StringType()))), StructField("venue", MapType(StringType(), StringType()), True), StructField("year", IntegerType(), True), StructField("keywords", ArrayType(StringType()), True), StructField("references", ArrayType(StringType()), True), StructField("n_citation", IntegerType(), True), StructField("page_start", StringType(), True), StructField("page_end", StringType(), True), StructField("doc_type", StringType(), True), StructField("lang", StringType(), True), StructField("publisher", StringType(), True), StructField("volume", StringType(), True), StructField("issue", StringType(), True), StructField("issn", StringType(), True), StructField("isbn", StringType(), True), StructField("doi", StringType(), True), StructField("pdf", StringType(), True), StructField("url", ArrayType(StringType()), True),
StructField("abstract", StringType(), True), StructField("indexed_abstract", StringType(), True)])
df = spark.read.option("wholeFile", True).option("mode", "PERMISSIVE").schema(schema).json("dblp_papers_v11.txt")
df.show()
輸出
[![ ---------- -------------------- -------------------- -------------------- ---- -------- -------------------- ---------- ---------- -------- ---------- ---- -------------------- ------ ----- ---- ---- -------------------- ---- ---- -------- --------------------
| id| title| authors| venue|year|keywords| references|n_citation|page_start|page_end| doc_type|lang| publisher|volume|issue|issn|isbn| doi| pdf| url|abstract| indexed_abstract|
---------- -------------------- -------------------- -------------------- ---- -------- -------------------- ---------- ---------- -------- ---------- ---- -------------------- ------ ----- ---- ---- -------------------- ---- ---- -------- --------------------
| 100001334|Ontologies in HYD...|\[{name -> Peter K...| {raw -> AMIF}|2009| null| null| 2| 43| 46| |null| | | |null|null| null|null|null| null| null|
|1000018889|Remote Policy Enf...|\[{name -> Fabio M...|{raw -> internati...|2013| null|\[94181602, 150466...| 2| 70| 84|Conference|null| Springer, Cham| | |null|null|10.1007/978-3-319...|null|null| null|{"IndexLength":17...|
|1000022707|A SIMPLE OBSERVAT...|\[{name -> Jerzy M...|{raw -> Reports o...|2009| null|\[1972178849, 2069...| 0| 19| 29| Journal|null| | 44| |null|null| null|null|null| null|{"IndexLength":49...|
| 100004108|Gait based human ...|\[{name -> Emdad H...|{raw -> internati...|2012| null|\[1578000111, 2120...| 0| 319| 328|Conference|null|Springer, Berlin,...| | |null|null|10.1007/978-3-642...|null|null| null|{"IndexLength":82...|
| 10000571|The GAME Algorith...|\[{name -> Pavel K...|{raw -> internati...|2008| null|\[291899685, 19641...| 5| 859| 868|Conference|null|Springer, Berlin,...| | |null|null|10.1007/978-3-540...|null|null| null|{"IndexLength":17...|
| 100007563|Formal Verificati...|\[{name -> George ...|{raw -> Software ...|2006| null|\[1578963809, 1612...| 1| 650| 656| Journal|null| | | |null|null| null|null|null| null|{"IndexLength":87...|
| 100008278|EMOTIONAL AND RAT...|\[{name -> Colin G...|{raw -> internati...|2010| null|\[116282327, 14967...| 2| 238| |Conference|null| | | |null|null| null|null|null| null|{"IndexLength":12...|
| 100008490|Principle-Based P...|\[{name -> Sandiwa...|{raw -> Natural L...|1991| null| null| 3| 43| 60| |null| | | |null|null| null|null|null| null| null|][1]][1]
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/447949.html
