如何使用PySpark將二維RDD中的字串轉換為int-有解無憂

我是 pyspark 的新手，并且已經嘗試了好幾個小時。

目前，我的 RDD 如下所示：

[['74', '85', '123'], ['73', '84', '122'], ['72', '83', '121'], ['70', '81', '119'], ['70', '81', '119'], ['69', '80', '118'], ['70', '81', '119'], ['70', '81', '119'], ['76', '87', '125'], ['76', '87', '125']]

我希望它看起來像這樣（所有條目都是整數）：

[[74, 85, 123], [73, 84, 122], [72, 83, 121], [70, 81, 119], [70, 81, 119], [69, 80, 118], [70, 81, 119], [70, 81, 119], [76, 87, 125], [76, 87, 125]]

我得到的最接近的是使用 flatMap 轉換為一維陣列，然后將條目轉換為整數。但是，我希望一次處理三個整數（一次計算 3 個條目的總和和平均值），我認為將它保存在二維陣列中是最簡單的方法。我也嘗試了串列推導，但它們似乎不起作用，因為它不是串列。任何幫助將不勝感激！

uj5u.com熱心網友回復：

Update

Before performing the below operations, you can use map and collect to convert your RDD into list.

rdd = spark.sparkContext.parallelize(data)
list_string = rdd.map(list).collect()

Using list comprehension is actually quick and efficient enough to convert all your strings to integers. Practising it more and you will know the way to deal with it.

list_value = [[int(i) for i in list_] for list_ in list_string]

print(list_value)
[[74, 85, 123], [73, 84, 122], [72, 83, 121], [70, 81, 119], [70, 81, 119], [69, 80, 118], [70, 81, 119], [70, 81, 119], [76, 87, 125], [76, 87, 125]]

The same goes for summing and averaging in 2D array.

list_sum = [sum(vector) for vector in list_value]
list_sum = [sum(vector)/len(vector) for vector in list_value]

Or better, just use NumPy to do the trick.

array = np.array(list_value)

np.sum(array, axis = 1)
Out[174]: array([282, 279, 276, 270, 270, 267, 270, 270, 288, 288])

np.average(array, axis=1)
Out[175]: array([94., 93., 92., 90., 90., 89., 90., 90., 96., 96.])

For speed comparison, I've created a list and an array of (1000,3). Hope this gives you some clear insight about their efficiency.

%timeit np.sum(array, axis=1)
%timeit [sum(vector) for vector in list_value]

20.3 μs ± 412 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
167 μs ± 3 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit np.average(array, axis=1)
%timeit [sum(vector)/len(vector) for vector in list_value]

29.3 μs ± 536 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
256 μs ± 23.1 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

For small sets of 2D list, however, it is faster than using NumPy array.

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/477864.html

標籤：Python 阿帕奇火花 pyspark

上一篇：spark不顯示所有內容

下一篇：火花流中的偏移管理