洗掉熊貓資料框中無意義的字符-有解無憂

我正在嘗試洗掉所有

\xf0\x9f\x93\xa2, \xf0\x9f\x95\x91\n\, \xe2\x80\xa6,\xe2\x80\x99t

在 Python pandas 列中鍵入以下字串中的字符。雖然文本以 b' 開頭，但它是一個字串

    Text
  _____________________________________________________
"b'Hello! \xf0\x9f\x93\xa2 End Climate Silence is looking for volunteers! \n\n1-2 hours per week. \xf0\x9f\x95\x91\n\nExperience doing digital research\xe2\x80\xa6

"b'I doubt if climate emergency 8s real, I think people will look ba\xe2\x80\xa6 '

"b'No, thankfully it doesn\xe2\x80\x99t. Can\xe2\x80\x99t see how cheap to overtourism in the alan alps can h\xe2\x80\xa6"

"b'Climate Change Poses a WidelllThreat to National Security "

"b""This doesn't feel like targeted propaganda at all. I mean states\xe2\x80\xa6"

"b'berates climate change activist who confronted her in airport\xc2\xa0

以上內容在pandas資料框中作為一列..

我在嘗試

string.encode('ascii', errors= 'ignore')

和正則運算式，但沒有運氣。如果我能得到一些建議會很有幫助。

uj5u.com熱心網友回復：

您的字串看起來像位元組字串，但不是這樣encode/decode不起作用。嘗試這樣的事情：

>>> df['text'].str.replace(r'\\x[0-9a-f]{2}', '', regex=True)

0    b'Hello!  End Climate Silence is looking for v...
1    b'I doubt if climate emergency 8s real, I thin...
2    b'No, thankfully it doesnt. Cant see how cheap...
3    b'Climate Change Poses a WidelllThreat to Nati...
4    b""This doesn't feel like targeted propaganda ...
5    b'berates climate change activist who confront...
Name: text, dtype: object

請注意，您必須清除不平衡的單/雙引號并洗掉第一個 'b' 字符。

uj5u.com熱心網友回復：

您可以遍歷字串并僅保留 ascii 字符：

my_str = "b'Hello! \xf0\x9f\x93\xa2 End Climate Silence is looking for volunteers! \n\n1-2 hours per week. \xf0\x9f\x95\x91\n\nExperience doing digital research\xe2\x80\xa6"
new_str = "".join(c for c in my_str if c.isascii())
print(new_str)

請注意，.encode('ascii', errors= 'ignore')它不會更改它所應用的字串，而是回傳編碼后的字串。這應該有效：

new_str = my_str.encode('ascii',errors='ignore')
print(new_str)

轉載請註明出處，本文鏈接：https://www.uj5u.com/qiye/345072.html

標籤：蟒蛇-3.x 正则表达式熊猫细绳正则表达式组

上一篇：如何將具有單個csv列的Pandas資料框插入MySQL資料庫

下一篇：合并來自GithubRepositoryLink的所有csv檔案并使其成為一個csv檔案