使用pandas/python格式化txt檔案-有解無憂

我有一個來自實驗室設備的 txt 檔案，它以以下格式保存資料：

Run1 
Selected data
        Time (s)    Charge Q (nC)   Charge density q (nC/g) Mass (g)    
Initial -   21.53   -2.81E-01   -1.41E-03   200.0   
Flow    -   0.00    0.00E 00    0.00E 00    0.0 
Charge (in Coulomb) temporal evolution
3.61    2.44e-11
4.11    2.44e-11
4.61    2.44e-11
5.11    3.66e-11
5.63    3.66e-11
6.14    2.44e-11
6.66    3.66e-11
7.14    3.66e-11
7.67    2.44e-11
8.19    3.66e-11
8.70    2.44e-11
9.20    2.44e-11
9.72    2.44e-11
10.23   2.44e-11
10.73   2.44e-11

Run2 
Selected data
        Time (s)    Charge Q (nC)   Charge density q (nC/g) Mass (g)    
Initial -   21.53   -2.81E-01   -1.41E-03   200.0   
Flow    -   0.00    0.00E 00    0.00E 00    0.0 
Charge (in Coulomb) temporal evolution
3.61    2.44e-11
4.11    2.44e-11
4.61    2.44e-11
5.11    3.66e-11
5.63    3.66e-11
6.14    2.44e-11
6.66    3.66e-11
7.14    3.66e-11
7.67    2.44e-11
8.19    3.66e-11

Run3 
Selected data
        Time (s)    Charge Q (nC)   Charge density q (nC/g) Mass (g)    
Initial -   21.53   -2.81E-01   -1.41E-03   200.0   
Flow    -   0.00    0.00E 00    0.00E 00    0.0 
Charge (in Coulomb) temporal evolution
3.61    2.44e-11
4.11    2.44e-11
4.61    2.44e-11
5.11    3.66e-11
5.63    3.66e-11
6.14    2.44e-11
6.66    3.66e-11
7.14    3.66e-11
7.67    2.44e-11
8.19    3.66e-11
8.70    2.44e-11
9.20    2.44e-11

我的測驗檔案夾中有多個這些。我希望簡化和自動化我對這些資料集所做的分析，因為對于另一臺設備，我用更簡單的代碼也取得了類似的成功。

我想要做的是使用 FileName 從每個檔案中提取 3 次運行中的每一個的 2 列測驗資料，并匯出到一個逗號分隔的文本檔案中，檔案名 = FileName-Run#.txt

到目前為止，我所做的是嘗試將文本檔案內容轉換為串列串列，然后嘗試將數字資料單獨處理為新的 csv，但這效果不佳，因為我無法檢測到列的長度我感興趣的資料。

這里的其他幾個 Q-As 在這方面提供了幫助，包括如何在檔案夾中的檔案上運行代碼，如果它有效，那就是。

我使用了一個 jupyter 筆記本——如果有用的話，我可以在這里分享我寫的代碼，盡管我很羞于展示它。

uj5u.com熱心網友回復：

嘗試這個：

import re
from pathlib import Path

input_path = Path("path/to/input_folder")
output_path = Path("path/to/output_folder")
run_name_pattern = re.compile("Run\d ")
data_line_pattern = re.compile("(. ?)  (. ?)")


def write_output(input_file: Path, run_name: str, data: str):
    output_file = output_path / f"{input_file.stem}-{run_name}.csv"
    with output_file.open("w") as fp_out:
        fp_out.write(data)


for input_file in input_path.glob("*.txt"):
    with input_file.open() as fp:
        run_name, data, start_reading = "", "", False

        for line in fp:
            # If a line matches "Run...", start a new run name
            if run_name_pattern.match(line):
                run_name = line.strip()
            # If the line matches "Charge (in Coulomb)...",
            # read in the data, starting with the next line
            elif line.startswith("Charge (in Coulomb) temporal evolution"):
                start_reading = True
            # For the data lines, replace spaces in the middle with a comma
            elif start_reading and line != "\n":
                data  = data_line_pattern.sub(r"\1,\2", line)
            # If we encounter a blank line, that means the end of data.
            # Flush the data to disk.
            elif line == "\n":
                write_output(input_file, run_name, data)
                run_name, data, start_reading = "", "", False
        else:
            # If we have reached the end of the file but there still
            # data we haven't written to disk, flush it
            if data:
                write_output(input_file, run_name, data)

uj5u.com熱心網友回復：

這有效：

import csv
import os
import re

# Where the input
PATH_INPUT = "./test.txt"
# Define the output directory
DIR_OUTPUT = "./output"



def is_section_start(line):
    """Function to check to see if a line is the start of a section
    
    The start of a section is defined as starting with "Run"
    """
    return re.match("^Run", line)


def is_data_line(line):
    """Function to check to see if a line is a data line
    
    A data line is defined if a line starts with a number
    """
    return re.match("^\d", line)


def get_data(line):
    """Split data line into the two numbers"""
    split = line.split(" ")
    split = [s for s in split if s]
    return [float(split[0]), float(split[1])]


if __name__ == "__main__":
    # Open up the input file and read data into a dictionary where the key is the run name and the
    # value is a list of list of the numbers.
    output = {}
    with open(PATH_INPUT) as f_in:
        current_section = None
        for line in f_in.readlines():
            line = line.strip()
            if is_section_start(line) and current_section != line:
                current_section = line
                output[current_section] = []
            
            if is_data_line(line):
                output[current_section].append(get_data(line))
    
    # Write data
    for run, data in output.items():
        with open(os.path.join(DIR_OUTPUT, f"{PATH_INPUT}-{run}.txt"), "w") as f_out:
            writer = csv.writer(f_out)
            writer.writerows(data)

uj5u.com熱心網友回復：

讀取資料有很多非常復雜的方法，但我想介紹一個更簡單的方法：

with open('file.txt') as f:
    data = f.read().split('\n\n')

for run in data:
    run = run.split('\n')
    run_num = run[0]
    df = pd.DataFrame(run[6:])[0].str.split(expand=True).astype(float)
    df.columns = ['Charge (in Coulomb)', 'temporal evolution']
    print(run_num)
    print(df)

輸出：

Run1
    Charge (in Coulomb)  temporal evolution
0                  3.61        2.440000e-11
1                  4.11        2.440000e-11
2                  4.61        2.440000e-11
3                  5.11        3.660000e-11
4                  5.63        3.660000e-11
5                  6.14        2.440000e-11
6                  6.66        3.660000e-11
7                  7.14        3.660000e-11
8                  7.67        2.440000e-11
9                  8.19        3.660000e-11
10                 8.70        2.440000e-11
11                 9.20        2.440000e-11
12                 9.72        2.440000e-11
13                10.23        2.440000e-11
14                10.73        2.440000e-11
Run2
   Charge (in Coulomb)  temporal evolution
0                 3.61        2.440000e-11
1                 4.11        2.440000e-11
2                 4.61        2.440000e-11
3                 5.11        3.660000e-11
4                 5.63        3.660000e-11
5                 6.14        2.440000e-11
6                 6.66        3.660000e-11
7                 7.14        3.660000e-11
8                 7.67        2.440000e-11
9                 8.19        3.660000e-11
Run3
    Charge (in Coulomb)  temporal evolution
0                  3.61        2.440000e-11
1                  4.11        2.440000e-11
2                  4.61        2.440000e-11
3                  5.11        3.660000e-11
4                  5.63        3.660000e-11
5                  6.14        2.440000e-11
6                  6.66        3.660000e-11
7                  7.14        3.660000e-11
8                  7.67        2.440000e-11
9                  8.19        3.660000e-11
10                 8.70        2.440000e-11
11                 9.20        2.440000e-11

uj5u.com熱心網友回復：

這種方法將文本檔案讀入資料框中的單個列，其中每行包含一行。從那里提取每個運行編號和運行結果。

import os
import pandas as pd

txt_file_path = r'D:\jchtempnew\SO\ResultFile.txt'
df = pd.read_fwf(txt_file_path, widths=[999999], header=None)

df_out = \
    df[0].str.extract('^(\d .*?) (\d .*)').astype(float) \
    .assign(Run=df[0].str.extract('^Run(\d )').ffill()).dropna() \
    .rename(columns={0:'Charge (in Coulomb)',1:'temporal evolution'})

for run in df_out['Run'].unique():   
    df_out[df_out['Run']==run] \
        .to_csv(f'{os.path.splitext(txt_file_path)[0]}-Run{run}.csv', index=None)

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/486677.html

標籤：Python 熊猫 CSV 文本

上一篇：將資料從第1行移到第0行

下一篇：如何快速組合特定列上的多個csv檔案？