使用numpy`fromregex`從檔案中讀取資料-有解無憂

我需要從幾個文本檔案中讀取一些資料，這些文本檔案開頭有亂數行文本。通常檔案如下所示：

file1.dat：

The file contains data
# this is a comment skip me
DataStart
  index = integer
Some text

 -5.0e-2 3.3 4.0
 0 0.0e0 0.0e0
 1.0 0.1 3.0
 1.5 4.0 1.87
 1.7 -4.67 0.124
 ...
 ...
 15.3 -3.5e02 1.775

file1.dat它的開頭可能包含多行文本，這些文本可以以空格、制表符等開頭。
我感興趣的資料塊總是在這些行下方，并且有固定的列數，在這種情況下，它有 3 列：

 -5.0e-2 3.3 4.0
 0 0.0e0 0.0e0
 1.0 0.1 3.0
 1.5 4.0 1.87
 1.7 -4.67 0.124
 ...
 ...
 15.3 -3.5e02 1.775

包含資料的行可能在每行的開頭有空格/制表符。

我嘗試了以下代碼：

import numpy as np

pattern = r'^[-0-9 ]*' 
mydata = np.fromregex('file1.dat', pattern, dtype=float)

但是當我運行它時，我得到：

~/.local/lib/python3.8/site-packages/numpy/lib/npyio.py in fromregex(file, regexp, dtype, encoding)
   1530             # Create the new array as a single data-type and then
   1531             #   re-interpret as a single-field structured array.
-> 1532             newdtype = np.dtype(dtype[dtype.names[0]])
   1533             output = np.array(seq, dtype=newdtype)
   1534             output.dtype = dtype

TypeError: 'NoneType' object is not subscriptable

非常感激您的幫忙

uj5u.com熱心網友回復：

我認為你的正則運算式需要看起來更像這樣：

pattern = r'\s*([- 0-9e.] )\s ([- 0-9e.] )\s ([- 0-9e.] ).*'

uj5u.com熱心網友回復：

In [603]: txt="""-5.0e-2 3.3 4.0
     ...:  0 0.0e0 0.0e0
     ...:  1.0 0.1 3.0
     ...:  1.5 4.0 1.87
     ...:  1.7 -4.67 0.124
     ...:  15.3 -3.5e02 1.775"""

對于標準的 csv 閱讀器，數字布局看起來足夠規則：

In [604]: np.genfromtxt(txt.splitlines())
Out[604]: 
array([[-5.000e-02,  3.300e 00,  4.000e 00],
       [ 0.000e 00,  0.000e 00,  0.000e 00],
       [ 1.000e 00,  1.000e-01,  3.000e 00],
       [ 1.500e 00,  4.000e 00,  1.870e 00],
       [ 1.700e 00, -4.670e 00,  1.240e-01],
       [ 1.530e 01, -3.500e 02,  1.775e 00]])

甚至行拆分：

In [605]: alist=[]
     ...: for line in txt.splitlines():
     ...:     alist.append(line.split())
     ...: 
In [606]: alist
Out[606]: 
[['-5.0e-2', '3.3', '4.0'],
 ['0', '0.0e0', '0.0e0'],
 ['1.0', '0.1', '3.0'],
 ['1.5', '4.0', '1.87'],
 ['1.7', '-4.67', '0.124'],
 ['15.3', '-3.5e02', '1.775']]
In [607]: np.array(alist, float)
Out[607]: 
array([[-5.000e-02,  3.300e 00,  4.000e 00],
       [ 0.000e 00,  0.000e 00,  0.000e 00],
       [ 1.000e 00,  1.000e-01,  3.000e 00],
       [ 1.500e 00,  4.000e 00,  1.870e 00],
       [ 1.700e 00, -4.670e 00,  1.240e-01],
       [ 1.530e 01, -3.500e 02,  1.775e 00]])

uj5u.com熱心網友回復：

要匹配浮點數，我們可以使用以下正則運算式（有關詳細資訊，請參閱此答案）：

[ \-]?(?:0|[1-9]\d*)(?:\.\d )?(?:[eE][ \-]?\d )?

您需要將其添加到組中()以從每一行中提取標記：

# zero or more white spaces
opt_whitespace = r'\s*'

# The number token
number= r'([ \-]?(?:0|[1-9]\d*)(?:\.\d )?(?:[eE][ \-]?\d )?)'

# one or more whitespaces
whitespace= r'\s '

# Number of data columns
N = 3

# The regex 
pattern = opt_whitespace   number   (whitespace   number)*(N-1)   opt_whitespace   r'\n'

data = np.fromregex('file1.dat', pattern, dtype=float)
print(data)

輸出：

[[-5.000e-02  3.300e 00  4.000e 00]
 [ 0.000e 00  0.000e 00  0.000e 00]
 [ 1.000e 00  1.000e-01  3.000e 00]
 [ 1.500e 00  4.000e 00  1.870e 00]
 [ 1.700e 00 -4.670e 00  1.240e-01]
 [ 1.530e 01 -3.500e 02  1.775e 00]]

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/383880.html

標籤：Python 正则表达式麻木的解析

上一篇：為什么table.remove和成對函式在RobloxStudio中都不起作用？

下一篇：如何在antlr4中獲取多個命令？