我有以下格式的文本檔案:
//DATASET
..... unnecessary lines.....
TIMEUNITS SECONDS
TS 0 1.98849600e 08
3.30000000e-03 1.25400000e-02 5.88000000e-03 0.00000000e 00 0.00000000e 00
5.88000000e-03 3.33000000e-03 2.16000000e-03 0.00000000e 00 0.00000000e 00
TS 0 1.98853209e 08
0.00000000e 00 0.00000000e 00 0.00000000e 00 0.00000000e 00 0.00000000e 00
1.25400000e-02 5.88000000e-03 3.33000000e-03 0.00000000e 00 0.00000000e 00
TS 0 1.98860419e 08
3.33000000e-03 2.16000000e-03 1.08000000e-03 0.00000000e 00 0.00000000e 00
0.00000000e 00 0.00000000e 00 0.00000000e 00 0.00000000e 00 0.00000000e 00
TS 0 1.98864081e 08
1.08000000e-03 8.70000000e-04 7.20000000e-04 0.00000000e 00 0.00000000e 00
0.00000000e 00 0.00000000e 00 0.00000000e 00 0.00000000e 00 0.00000000e 00
TS 0 1.98867619e 08
0.00000000e 00 0.00000000e 00 0.00000000e 00 0.00000000e 00 0.00000000e 00
3.33000000e-03 2.16000000e-03 1.08000000e-03 0.00000000e 00 0.00000000e 00
我還在此鏈接中附上了名為“D50.bc”的示例文本檔案:https ://drive.google.com/file/d/1P5aFC0JsRLhwuUo7JENLg03DbDJ696lk/view?usp=sharing 。
沒有列名,但可以添加列名,即 V1、V2 等。在真正的文本檔案中,每個 TS 后有 14 列和 1000 行/行。每行對應一個節點,列對應該節點的某些值(即速度/剪應力等)
我想根據時間戳(TS)將“TS 0 XXX”行下的所有資料/表提取到各個 dfs 中,以便我可以對每個 TS 進行列操作。位于第二個的 TS 值 XXXX 可以作為單獨的列添加到 dfs.xml 檔案中。以下是我想要在熊貓資料框中的輸出模式:
# V1 V2 V3 V4 V5 grp node TS TS0
# 1.1 0.00330 0.01254 0.00588 0 0 1 1 198849600 TS 0 1.98849600e 08
# 1.2 0.00588 0.00333 0.00216 0 0 1 2 198849600 TS 0 1.98849600e 08
# 2.1 0.00000 0.00000 0.00000 0 0 2 1 198853209 TS 0 1.98853209e 08
# 2.2 0.01254 0.00588 0.00333 0 0 2 2 198853209 TS 0 1.98853209e 08
# 3.1 0.00333 0.00216 0.00108 0 0 3 1 198860419 TS 0 1.98860419e 08
# 3.2 0.00000 0.00000 0.00000 0 0 3 2 198860419 TS 0 1.98860419e 08
# 4.1 0.00108 0.00087 0.00072 0 0 4 1 198864081 TS 0 1.98864081e 08
# 4.2 0.00000 0.00000 0.00000 0 0 4 2 198864081 TS 0 1.98864081e 08
# 5.1 0.00000 0.00000 0.00000 0 0 5 1 198867619 TS 0 1.98867619e 08
# 5.2 0.00333 0.00216 0.00108 0 0 5 2 198867619 TS 0 1.98867619e 08
這是解決問題的 R 代碼,但我想要一個 python 代碼做同樣的事情。
spltxt <- split(txt, cumsum(grepl("^\\s*TS 0 ", txt)))[-1]
alldat <- Map(function(S, grp) {
out <- read.table(text = S[-1], header = FALSE)
out$grp <- grp
out$node <- seq_len(nrow(out))
TS <- trimws(strsplit(S[1], "\\s ")[[1]])
out$TS <- as.numeric(TS[length(TS)])
out$TS0 <- S[1]
out
}, spltxt, seq_along(spltxt))
out <- do.call(rbind, alldat)
uj5u.com熱心網友回復:
我試一試:
import pandas as pd
#open and read file
with open('D50.bc') as f:
text = f.read()
#create list of TS groups
data_string = [i.splitlines() for i in text.split('TIMEUNITS SECONDS\n')[1].split('TS ')[1:]]
#create nested dictionary of TS values
data = [{'TS0': f'TS {i[0]}', 'TS': int(float((i[0].split()[-1]))), 'grp': n1, 'data':[{'data':{f'V{n3}': float(z) for n3, z in enumerate(x.split(' ')[1:])}, 'node':n2} for n2, x in enumerate(i[1:])]} for n1, i in enumerate(data_string)]
#load to dataframe and flatten nested dict
df = pd.DataFrame(d).explode('data')
df = df.join(pd.DataFrame(df.pop('data').values.tolist()))
df = df.join(pd.DataFrame(df.pop('data').values.tolist()))
輸出:
| TS0 | TS | 格魯普 | 節點 | V0 | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | TS 0 1.98849600e 08 | 198849600 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 0 | TS 0 1.98849600e 08 | 198849600 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 0 | TS 0 1.98849600e 08 | 198849600 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 0 | TS 0 1.98849600e 08 | 198849600 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 0 | TS 0 1.98849600e 08 | 198849600 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
uj5u.com熱心網友回復:
我寫這篇文章時著眼于速度和效率,目的是避免不必要的檔案復制,并避免一次將整個檔案保存在記憶體中。出于同樣的原因,它使用 Pandas C 引擎來讀取檔案。邏輯最終比另一個答案更復雜。
#!/usr/bin/env python3
import io
import pandas as pd
def parse_chunk(current_ts_chunk, index, current_ts):
buf = io.StringIO(''.join(current_ts_chunk))
chunk = pd.read_csv(buf, delim_whitespace=True, header=None)
chunk['group'] = index
chunk['node'] = chunk.index 1
chunk['TS'] = current_ts.split()[2]
chunk['TS0'] = current_ts
return chunk
with open("D50.bc", "rt") as f:
# Advance through file until you get to TIMEUNITS SECONDS
for line in f:
if line.startswith("TIMEUNITS SECONDS"):
break
chunk_list = []
current_ts = None
current_ts_chunk = []
index = 1
for line in f:
if line.startswith("TS "):
if current_ts is None:
# First TS in file
current_ts = line.strip()
else:
# Write out previous TS
chunk = parse_chunk(current_ts_chunk, index, current_ts)
chunk_list.append(chunk)
# Get ready to accept current TS
current_ts = line.strip()
current_ts_chunk = []
index = 1
else:
# This is a data line
current_ts_chunk.append(line)
if current_ts is not None:
chunk = parse_chunk(current_ts_chunk, index, current_ts)
chunk_list.append(chunk)
df = pd.concat(chunk_list)
print(df)
這給出了這個結果:
0 1 2 ... node TS TS0
0 0.00000 0.0000 0.00000 ... 1 1.98849600e 08 TS 0 1.98849600e 08
1 0.00000 0.0000 0.00000 ... 2 1.98849600e 08 TS 0 1.98849600e 08
2 0.00000 0.0000 0.00000 ... 3 1.98849600e 08 TS 0 1.98849600e 08
3 0.00000 0.0000 0.00000 ... 4 1.98849600e 08 TS 0 1.98849600e 08
4 0.00000 0.0000 0.00000 ... 5 1.98849600e 08 TS 0 1.98849600e 08
5 0.00000 0.0000 0.00000 ... 6 1.98849600e 08 TS 0 1.98849600e 08
6 0.00000 0.0000 0.00000 ... 7 1.98849600e 08 TS 0 1.98849600e 08
7 0.00000 0.0000 0.00000 ... 8 1.98849600e 08 TS 0 1.98849600e 08
轉載請註明出處,本文鏈接:https://www.uj5u.com/qukuanlian/361181.html
