來自文本檔案的Python資料框-有解無憂

我有以下格式的文本檔案：

//DATASET

..... unnecessary lines.....

TIMEUNITS SECONDS

    TS 0  1.98849600e 08
        3.30000000e-03    1.25400000e-02    5.88000000e-03    0.00000000e 00    0.00000000e 00
        5.88000000e-03    3.33000000e-03    2.16000000e-03    0.00000000e 00    0.00000000e 00
    TS 0  1.98853209e 08
        0.00000000e 00    0.00000000e 00    0.00000000e 00    0.00000000e 00    0.00000000e 00
        1.25400000e-02    5.88000000e-03    3.33000000e-03    0.00000000e 00    0.00000000e 00
    TS 0  1.98860419e 08
        3.33000000e-03    2.16000000e-03    1.08000000e-03    0.00000000e 00    0.00000000e 00
        0.00000000e 00    0.00000000e 00    0.00000000e 00    0.00000000e 00    0.00000000e 00
    TS 0  1.98864081e 08
        1.08000000e-03    8.70000000e-04    7.20000000e-04    0.00000000e 00    0.00000000e 00
        0.00000000e 00    0.00000000e 00    0.00000000e 00    0.00000000e 00    0.00000000e 00
    TS 0  1.98867619e 08
        0.00000000e 00    0.00000000e 00    0.00000000e 00    0.00000000e 00    0.00000000e 00
        3.33000000e-03    2.16000000e-03    1.08000000e-03    0.00000000e 00    0.00000000e 00

我還在此鏈接中附上了名為“D50.bc”的示例文本檔案：https ://drive.google.com/file/d/1P5aFC0JsRLhwuUo7JENLg03DbDJ696lk/view?usp=sharing 。

沒有列名，但可以添加列名，即 V1、V2 等。在真正的文本檔案中，每個 TS 后有 14 列和 1000 行/行。每行對應一個節點，列對應該節點的某些值（即速度/剪應力等）

我想根據時間戳（TS）將“TS 0 XXX”行下的所有資料/表提取到各個 dfs 中，以便我可以對每個 TS 進行列操作。位于第二個的 TS 值 XXXX 可以作為單獨的列添加到 dfs.xml 檔案中。以下是我想要在熊貓資料框中的輸出模式：

#          V1      V2      V3 V4 V5 grp node        TS                      TS0
# 1.1 0.00330 0.01254 0.00588  0  0   1    1 198849600     TS 0  1.98849600e 08
# 1.2 0.00588 0.00333 0.00216  0  0   1    2 198849600     TS 0  1.98849600e 08
# 2.1 0.00000 0.00000 0.00000  0  0   2    1 198853209     TS 0  1.98853209e 08
# 2.2 0.01254 0.00588 0.00333  0  0   2    2 198853209     TS 0  1.98853209e 08
# 3.1 0.00333 0.00216 0.00108  0  0   3    1 198860419     TS 0  1.98860419e 08
# 3.2 0.00000 0.00000 0.00000  0  0   3    2 198860419     TS 0  1.98860419e 08
# 4.1 0.00108 0.00087 0.00072  0  0   4    1 198864081     TS 0  1.98864081e 08
# 4.2 0.00000 0.00000 0.00000  0  0   4    2 198864081     TS 0  1.98864081e 08
# 5.1 0.00000 0.00000 0.00000  0  0   5    1 198867619     TS 0  1.98867619e 08
# 5.2 0.00333 0.00216 0.00108  0  0   5    2 198867619     TS 0  1.98867619e 08

這是解決問題的 R 代碼，但我想要一個 python 代碼做同樣的事情。

spltxt <- split(txt, cumsum(grepl("^\\s*TS 0 ", txt)))[-1]
alldat <- Map(function(S, grp) {
  out <- read.table(text = S[-1], header = FALSE)
  out$grp <- grp
  out$node <- seq_len(nrow(out))
  TS <- trimws(strsplit(S[1], "\\s ")[[1]])
  out$TS <- as.numeric(TS[length(TS)])
  out$TS0 <- S[1]
  out
}, spltxt, seq_along(spltxt))
out <- do.call(rbind, alldat)

uj5u.com熱心網友回復：

我試一試：

import pandas as pd

#open and read file
with open('D50.bc') as f:
    text = f.read()

#create list of TS groups
data_string = [i.splitlines() for i in text.split('TIMEUNITS SECONDS\n')[1].split('TS ')[1:]]

#create nested dictionary of TS values
data = [{'TS0': f'TS {i[0]}', 'TS': int(float((i[0].split()[-1]))), 'grp': n1, 'data':[{'data':{f'V{n3}': float(z) for n3, z in enumerate(x.split('    ')[1:])}, 'node':n2} for n2, x in enumerate(i[1:])]} for n1, i in enumerate(data_string)]

#load to dataframe and flatten nested dict
df = pd.DataFrame(d).explode('data')
df = df.join(pd.DataFrame(df.pop('data').values.tolist()))
df = df.join(pd.DataFrame(df.pop('data').values.tolist()))

輸出：

TS0	TS	V11
TS 0 1.98849600e 08	198849600	1
TS 0 1.98849600e 08	198849600	1
TS 0 1.98849600e 08	198849600	1
TS 0 1.98849600e 08	198849600	1
TS 0 1.98849600e 08	198849600	1

uj5u.com熱心網友回復：

我寫這篇文章時著眼于速度和效率，目的是避免不必要的檔案復制，并避免一次將整個檔案保存在記憶體中。出于同樣的原因，它使用 Pandas C 引擎來讀取檔案。邏輯最終比另一個答案更復雜。

#!/usr/bin/env python3
import io
import pandas as pd


def parse_chunk(current_ts_chunk, index, current_ts):
    buf = io.StringIO(''.join(current_ts_chunk))
    chunk = pd.read_csv(buf, delim_whitespace=True, header=None)
    chunk['group'] = index
    chunk['node'] = chunk.index   1
    chunk['TS'] = current_ts.split()[2]
    chunk['TS0'] = current_ts
    return chunk


with open("D50.bc", "rt") as f:
    # Advance through file until you get to TIMEUNITS SECONDS
    for line in f:
        if line.startswith("TIMEUNITS SECONDS"):
            break
    chunk_list = []
    current_ts = None
    current_ts_chunk = []
    index = 1
    for line in f:
        if line.startswith("TS "):
            if current_ts is None:
                # First TS in file
                current_ts = line.strip()
            else:
                # Write out previous TS
                chunk = parse_chunk(current_ts_chunk, index, current_ts)
                chunk_list.append(chunk)
                # Get ready to accept current TS
                current_ts = line.strip()
                current_ts_chunk = []
                index  = 1
        else:
            # This is a data line
            current_ts_chunk.append(line)
    if current_ts is not None:
        chunk = parse_chunk(current_ts_chunk, index, current_ts)
        chunk_list.append(chunk)

    df = pd.concat(chunk_list)

print(df)

這給出了這個結果：

         0       1        2  ...  node              TS                   TS0
0  0.00000  0.0000  0.00000  ...     1  1.98849600e 08  TS 0  1.98849600e 08
1  0.00000  0.0000  0.00000  ...     2  1.98849600e 08  TS 0  1.98849600e 08
2  0.00000  0.0000  0.00000  ...     3  1.98849600e 08  TS 0  1.98849600e 08
3  0.00000  0.0000  0.00000  ...     4  1.98849600e 08  TS 0  1.98849600e 08
4  0.00000  0.0000  0.00000  ...     5  1.98849600e 08  TS 0  1.98849600e 08
5  0.00000  0.0000  0.00000  ...     6  1.98849600e 08  TS 0  1.98849600e 08
6  0.00000  0.0000  0.00000  ...     7  1.98849600e 08  TS 0  1.98849600e 08
7  0.00000  0.0000  0.00000  ...     8  1.98849600e 08  TS 0  1.98849600e 08

轉載請註明出處，本文鏈接：https://www.uj5u.com/qukuanlian/361181.html

標籤：Python r 正则表达式熊猫

上一篇：正則運算式：在復雜的正則運算式中平衡“{}”（python）

下一篇：Python正則運算式獲取子字串包含'/\'