決議ascii表頭-有解無憂

所以我需要把它決議成資料框或串列：

tmp =
[' -------------- ----------------------------------------- ',
 '| Something to |        Some header with subheader       |',
 '|  watch or     ----------------- ----------------------- ',
 '|     idk      |      First      |   another text again  |',
 '|              |                 |  with one more line   |',
 '|              |                  ----------------------- ',
 '|              |                 |  and this | how it be |',
 ' -------------- ----------------- ----------------------- ']

它只是帶有奇怪標題的 txt 表。我需要將其轉換為：

['Something to watch or idk', 'Some header with subheader First', 'Some header with subheader another text again with one more line and this', 'Some header with subheader another text again with one more line how it be']

這是我第一個讓我更接近勝利的解決方案（你可以看到我嘗試的評論）：

pluses = [i for i, element in enumerate(tmp) if element[0] == ' ']
tmp2 = tmp[pluses[0]:pluses[1] 1].copy()
table_str=''.join(tmp[pluses[0]:pluses[1] 1])
col=[[i for i, symbol in enumerate(line) if symbol == ' ' or symbol == '|'] for line in tmp2]

tmp3=[]
strt = ''.join(tmp2.copy())
table_list = [l.strip().replace('\n', '') for l in re.split(r'\ [ -] ', strt) if l.strip()]
for row in table_list:
    joined_row = ['' for _ in range(len(row))]
    for lines in [line for line in row.split('||')]:
        line_part = [i.strip() for i in lines.split('|') if i]
        joined_row = [i   j for i, j in zip(joined_row, line_part)]
        tmp3.append(joined_row)

出來了：

tmp3
out[4]:
[['Something to', 'Some header with subheader'],
 ['Something towatch or'],
 ['idk', 'First', 'another text again'],
 ['idk', 'First', 'another text againwith one more line'],
 ['idk'],
 ['', '', 'and this', 'how it be']]

仍然只能以正確的方式加入，但不知道如何...

這是插件：我們可以通過以下方式找到加號和拆分器：

col=[[i for i, symbol in enumerate(line) if symbol == ' ' or symbol == '|'] for line in tmp2]
[[0, 15, 57],
 [0, 15, 57],
 [0, 15, 33, 57],
 [0, 15, 33, 57],
 [0, 15, 33, 57],
 [0, 15, 33, 57],
 [0, 15, 33, 45, 57],
 [0, 15, 33, 57]]

然后我們可以按單元格拆分或分組，但我也知道如何...請幫助

示例 2：

 ---------- ------------------------------------------------------------ --------------- ---------------------------------- -------------------- ----------------------- 
|   Number |       longtextveryveryloooooong                            |  aaaaaaaaaaa  |         bbbbbbbbbbbbbbbbbb       |    dfsdfgsdfddd    |qqqqqqqqqqqqqqqqqqqqqq |
| string   |                                                            |               |        ccccccccccccccccccccc     |    affasdd  as     |qqqqqqqqqqqqqqqqqqqqqq |
|          |                                                            |               | eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee,|    seeerrrr   e,   |   dfsdfffffffffffff   |
|          |                                                            |               | anothertext and something        |       percent      |   ttttttttttttttttt   |
|          |                                                            |               |           (nothingtodo),         |                    | sssssssssssssssssssss |
|          |                                                            |               |             and text             |                    |zzzzzzzzzzzzzzzzzzzzzz |
|          |                                                            |                ----------------------------------                     | b rererereerr ppppppp |
|          |                                                            |               |     all         | longtext wit-  |                    |                       |
|          |                                                            |               |                 |h many character|                    |                       |
 ---------- ------------------------------------------------------------ --------------- ----------------- ---------------- -------------------- -----------------------

uj5u.com熱心網友回復：

您可以遞回地執行此操作 - 一次決議每個“子表”：

def parse_table(table, header='', root='', table_len=None):
    # store length of original table
    if not table_len:
        table_len = len(table)

    # end of current "column"
    col = table[0].find(' ', 1)
    rows = [
        row for row in range(1, len(table))
            if  table[row].startswith(' ')
            and table[row][col] == ' '
    ]
    row = rows[0]

    # split "line" contents into columns
    # end of "line" is either ` ` or final `|`
    end = col
    num_cols = table[0].count(' ')
    if num_cols != table[1].count('|'):
        end = table[1].rfind('|')
    columns = (line[1:end].split('|') for line in table[1:row])

    # rebuild each column appending to header
    content = [
        ' '.join([header]   [line.strip() for line in lines]).strip()
        for lines in zip(*columns)
    ]

    # is there a table below?
    if row   2 < len(table):
        header = content[-1]
        # if we are not the last table - we are a header
        if len(rows) > 1:
            header = content.pop()
        # if we are the first table in column - we are the root 
        if not root:
            root = header
        next_table = [line[:col   1] for line in table[row:]]
        content.extend(
            parse_table(
                next_table,
                header=header,
                root=root,
                table_len=table_len
            )
        )

    # is there a table to the right?
    if col   2 < len(table[0]):
        # find start line of next table
        row = next(
            row for row, line in enumerate(table, start=-1)
                if line[col] == '|'
        )
        next_table = [line[col:] for line in table[row:]]
        # new top-level table - reset root
        if len(next_table) == table_len:
            root = ''
        # next table on same level - reset header 
        if len(table) == len(next_table):
            header = root
        content.extend(
            parse_table(
                next_table,
                header=header,
                root=root,
                table_len=table_len
            )
        )

    return content

輸出：

>>> parse_table(table)
['Something to watch or idk',
 'Some header with subheader First',
 'Some header with subheader another text again with one more line and this',
 'Some header with subheader another text again with one more line how it be']
>>> parse_table(big_table)
['Number string',
 'longtextveryveryloooooong',
 'aaaaaaaaaaa',
 'bbbbbbbbbbbbbbbbbb ccccccccccccccccccccc eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee, anothertext and something (nothingtodo), and text all',
 'bbbbbbbbbbbbbbbbbb ccccccccccccccccccccc eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee, anothertext and something (nothingtodo), and text longtext wit- h many character',
 'dfsdfgsdfddd affasdd  as seeerrrr   e, percent',
 'qqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqq dfsdfffffffffffff ttttttttttttttttt sssssssssssssssssssss zzzzzzzzzzzzzzzzzzzzzz b rererereerr ppppppp']
>>> parse_table(planets)
['Planets Planet Sun (Solar) Earth Moon Mars',
 'Planets R (km) 696000 6371 1737 3390',
 'Planets mass (x 10^29 kg) 1989100000 5973.6 73.5 641.85']

uj5u.com熱心網友回復：

由于輸入是 reStructuredText 表的格式，您可以使用 docutils table parser。

import docutils.parsers.rst.tableparser
from collections.abc import Iterable

def extract_texts(tds):
    " recursively extract StringLists and join"
    texts = []
    for e in tds:
        if isinstance(e, docutils.statemachine.StringList):
            texts.append(' '.join([s.strip() for s in list(e) if s]))
            break
        if isinstance(e, Iterable):
            texts.append(extract_texts(e))
    return texts

>>> parser = docutils.parsers.rst.tableparser.GridTableParser()
>>> tds = parser.parse(docutils.statemachine.StringList(tmp))
>>> extract_texts(tds)

[[],
 [],
 [[['Something to watch or idk'], ['Some header with subheader']],
  [['First'], ['another text again with one more line']],
  [['and this | how it be']]]]

然后壓平。

對于更一般的用法，看看tds（決議回傳的結構）很有趣：那里的一些檔案

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/436484.html

標籤：Python 解析 ASCII 回覆漂亮的

上一篇：了解menhir生成的.messages檔案

下一篇：Parse-Swift檢查多列（Compound/OR陳述句）