我試圖按照本教程構建一個機器學習,該教程在 iris 資料集上運行良好,但是,當我嘗試使用我自己的 CSV(用于專案)時,它給了我一個錯誤。當我嘗試使用不同的、不相關的方法時,同樣的事情發生了。(其余的細節在底部)這是我的代碼:
# Python version
import sys
from sklearn.metrics import make_scorer
print('Python: {}'.format(sys.version))
# scipy
import scipy
print('scipy: {}'.format(scipy.__version__))
# numpy
import numpy
print('numpy: {}'.format(numpy.__version__))
# matplotlib
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
# pandas
import pandas
print('pandas: {}'.format(pandas.__version__))
# scikit-learn
import sklearn
print('sklearn: {}'.format(sklearn.__version__))
# compare algorithms
from pandas import read_csv
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.feature_selection import RFE
# Load dataset
url = "energy.csv"
#url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['YEAR', 'TOTAL', 'PURCHASED', 'NUCLEAR', 'SOLAR', 'WIND', 'NATURAL_GAS', 'COAL', 'OIL']
dataset = read_csv(url, names=names)
print(dataset.shape)
# Split-out validation dataset
array = dataset.values
X = array[:, 0:8]
y = array[:, 8]
print(y)
我的csv:
18,28,564,0,6284.08,1713.84,19.9948,19994.8,19.9948,19.9948
17,28,411,0,6250.42,852.33,0,20740.03,568.22,0
16,27,515,0,6053.3,550.3,0,20361.1,550.3,0
15,24,586,491.72,5408.92,245.86,0,17947.78,491.72,0
14,26,653,533.06,6130.19,0,0,18923.63,1066.12,0
13,26,836,805.08,6172.28,0,0,18785.2,1073.44,0
12,26,073,1303.65,5736.06,0,0,17990.37,1042.92,0
11,27,055,1352.75,6222.65,0,0,18397.4,1082.2,0
10,26,236,1311.8,6034.28,0,0,17578.12,1311.8,0
9,26,020,1821.4,3903,0,0,18994.6,1040.8,260.2
8,26,538,0,4246.08,265.38,13799.76,6369.12,0,1326.9
7,25800,3354,5160,0,0,14964,1290,1032
6,26682,3468.66,5603.22,0,0,14941.92,1600.92,1067.28
5,24997,3499.58,5499.34,0,0,13248.41,1499.82,1249.85
4,25100,3765,4769,0,0,13052,1506,2008
3,24651,4190.67,4930.2,0,0,12325.5,1232.55,1972.08
2,12,053,0,1084.77,0,3133.78,6508.62,0,723.18
1,11,500,2070,2415,0,0,4255,690,2070
當我在最后一行列印 y 時,我得到了這個:
[ 19.9948 0. 0. 0. 0. 0. 0.
0. 0. 260.2 1326.9 nan nan nan
nan nan 723.18 2070. ]
我不相信應該發生('nan'事情)。我在這方面不是很有經驗,所以任何關于正在發生的事情的方向都會非常感激,在此先感謝。
uj5u.com熱心網友回復:
并非 CSV 中的所有行都包含所有列的資料 (10)。當缺少資料時,它表示為 NaN(not-a-number 的縮寫)。
In [42]: df = pd.read_csv("/tmp/energy.csv", header=None)
In [43]: df
Out[43]:
0 1 2 3 4 5 6 7 8 9
0 18 28 564.00 0.00 6284.08 1713.84 19.9948 19994.80 19.9948 19.9948
1 17 28 411.00 0.00 6250.42 852.33 0.0000 20740.03 568.2200 0.0000
2 16 27 515.00 0.00 6053.30 550.30 0.0000 20361.10 550.3000 0.0000
3 15 24 586.00 491.72 5408.92 245.86 0.0000 17947.78 491.7200 0.0000
4 14 26 653.00 533.06 6130.19 0.00 0.0000 18923.63 1066.1200 0.0000
5 13 26 836.00 805.08 6172.28 0.00 0.0000 18785.20 1073.4400 0.0000
6 12 26 73.00 1303.65 5736.06 0.00 0.0000 17990.37 1042.9200 0.0000
7 11 27 55.00 1352.75 6222.65 0.00 0.0000 18397.40 1082.2000 0.0000
8 10 26 236.00 1311.80 6034.28 0.00 0.0000 17578.12 1311.8000 0.0000
9 9 26 20.00 1821.40 3903.00 0.00 0.0000 18994.60 1040.8000 260.2000
10 8 26 538.00 0.00 4246.08 265.38 13799.7600 6369.12 0.0000 1326.9000
11 7 25800 3354.00 5160.00 0.00 0.00 14964.0000 1290.00 1032.0000 NaN
12 6 26682 3468.66 5603.22 0.00 0.00 14941.9200 1600.92 1067.2800 NaN
13 5 24997 3499.58 5499.34 0.00 0.00 13248.4100 1499.82 1249.8500 NaN
14 4 25100 3765.00 4769.00 0.00 0.00 13052.0000 1506.00 2008.0000 NaN
15 3 24651 4190.67 4930.20 0.00 0.00 12325.5000 1232.55 1972.0800 NaN
16 2 12 53.00 0.00 1084.77 0.00 3133.7800 6508.62 0.0000 723.1800
17 1 11 500.00 2070.00 2415.00 0.00 0.0000 4255.00 690.0000 2070.0000
見第 11-15 行。你可以在這個FAQ中找到更多關于 Pandas 中 NaN 的資訊。
編輯: 也許很難看到整個資料框。下面我只顯示第 10 列和第 5 列。
In [57]: df.iloc[10:, 5:]
Out[57]:
5 6 7 8 9
10 265.38 13799.76 6369.12 0.00 1326.90
11 0.00 14964.00 1290.00 1032.00 NaN
12 0.00 14941.92 1600.92 1067.28 NaN
13 0.00 13248.41 1499.82 1249.85 NaN
14 0.00 13052.00 1506.00 2008.00 NaN
15 0.00 12325.50 1232.55 1972.08 NaN
16 0.00 3133.78 6508.62 0.00 723.18
17 0.00 0.00 4255.00 690.00 2070.00
In [58]: df.iloc[11:16, 9]
Out[58]:
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
Name: 9, dtype: float64
In [59]: df.iloc[11:16, 9].isna()
Out[59]:
11 True
12 True
13 True
14 True
15 True
Name: 9, dtype: bool
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/520949.html
