在文本檔案中查找未知的非英文字符（python）-有解無憂

假設我們有一個文本檔案加載：

file = open('my_file.txt',mode='r')
stg = file.read()

此檔案包含一些非英文未知字符。這些字符可能有不同的形式，如á, ?,?等。如何提取這些字符及其在文本檔案中的位置？所以輸出是這些字符的串列及其位置（行號）。

uj5u.com熱心網友回復：

因此，假設您不想找到所有非 [english, number, punctuation, backslash] 字符，您可以使用以下代碼查找所有位置和數字

[(match.start(0), match.group()) for match in re.finditer(f'[^a-zA-Z0-9{string.punctuation}\\\]', stg)]

使用示例

ábxcsdas??????????????adasda/.1.32131.!#@%$%&*^()|\}}"?>:{}?><<"

它會回來

[(0, 'á'), (8, '?'), (9, '?'), (10, '?'), (11, '?'), (12, '?'), (13, '?'), (14, '?'), (15, '?'), (16, '?'), (17, '?'), (18, '?'), (19, '?'), (20, '?'), (21, '?')]

uj5u.com熱心網友回復：

這是我用于我的一個專案的代碼。它不檢查標點符號和特殊字符。

file = open('test.txt',mode='r')
lines = file.readlines()

def isEnglishChar(s):
    try:
        s.encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return False
    else:
        return True

for index, value in enumerate(lines):
    for i in range(0, len(value)):
        bool = isEnglishChar(value[i])
        if(not bool):
            print (value[i], index 1)

uj5u.com熱心網友回復：

ASCII 字符的 Unicode 值介于 0 和 127 之間。任何 Unicode 值大于 127 的字符都不是 ASCII。

with open(filename) as fp:
    for lineno, line in enumerate(fp, start=1):
        for ch in line:
            if ord(ch) > 127:
                print(lineno, ch)

uj5u.com熱心網友回復：

with open("testfile.txt", 'w') as f_out:
    test_text= '''
    This file contains some non-English unknown characters. 
    These characters may have different forms like á, 
    ?, ?, etc. How can I extract these characters with their location in the text file
    '''
    f_out.write(test_text)
with open("testfile.txt") as fp:
    for lineno, line in enumerate(fp, start=1):
        ch_count = 0
        for ch in line:
            ch_count  = 1
            if ord(ch) > 127:
                print(f'{lineno=}\tCharacter Number={ch_count}\t {ch=}')

輸出

lineno=3    Character Number=52  ch='á'
lineno=4    Character Number=5   ch='?'
lineno=4    Character Number=8   ch='?'

轉載請註明出處，本文鏈接：https://www.uj5u.com/qiye/433670.html

標籤：Python 细绳文本

上一篇：C中的命令陣列

下一篇：為什么我的代碼不能很好地替換字串字符？