如何從熊貓DataFrame列中洗掉表情符號？適用于單個字串但不適用于整個列-有解無憂

我正在嘗試從熊貓資料框中的列中洗掉表情符號。使用此代碼：

def remove_emoji(string):
   emoji_pattern = re.compile("["
                       u"\U0001F600-\U0001F64F"  # emoticons
                       u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                       u"\U0001F680-\U0001F6FF"  # transport & map symbols
                       u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                       u"\U00002702-\U000027B0"
                       u"\U000024C2-\U0001F251"
                       "] ", flags=re.UNICODE)
return emoji_pattern.sub(r'', string)
def decontracted(phrase):
   # specific
   phrase = phrase.rstrip()
   phrase = ' '.join(phrase.split())
   phrase = re.sub(r'\w :\/{2}[\d\w-] (\.[\d\w-] )*(?:(?:\/[^\s/]*))*', '', phrase)
   phrase = re.sub('@[\w] ','',phrase)
   phrase = re.sub(r'[^\x00-\x7f]',r'', phrase)
# general
   phrase = re.sub('@[^\s] ','',phrase)
   phrase = remove_accented_chars(phrase)
   phrase = remove_special_characters(phrase)
   phrase = remove_emoji(phrase)
   return phrase

def remove_accented_chars(text):
  new_text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
  return new_text

def remove_special_characters(text):
  # define the pattern to keep
  pat = r'[^a-zA-z0-9.,!?/:;\"\'\s]'
  return re.sub(pat, '', text)

將其應用于資料框列，如下所示：

AAVE["sentence"] = AAVE["sentence"].apply(decontracted)

['他最好快點打牌回來'，'我訂購了一部新手機'，'大聲笑，寶貝 \ud83d\ude18\u2764\ud83d\ude0d'，'imma cry']

以上是我正在測驗的文本示例。\ud83d\ude18\u2764\ud83d\ude0d 不會被洗掉。

- - - - - - -編輯 - - - - - -

這是我用來加載 TSV 檔案中的資料的代碼：

AAVE = pd.read_csv('twitteraae_all_aa', sep='\t', on_bad_lines='skip')
columns = ['ID', 'Date', 'Num', 'Location','Num2', 'AA', 'Hispanic', 'Other', 'White']
AAVE.drop(columns, inplace=True, axis=1)
AAVE = AAVE.rename(columns={'Sentence': 'sentence'})
AAVE['label'] = 1

AAVE['sentence'] = AAVE['sentence'][0:391165].astype('string')
AAVE = AAVE.dropna()
AAVE['sentence1'] = AAVE['sentence'].astype('string').apply(decontracted).astype('string')

如果我創建一個字串陣列并應用 decontract 函式，該代碼將起作用，但如果我將它應用于資料框，我想要洗掉的所有其他東西都可以作業，但表情符號不行。

uj5u.com熱心網友回復：

你必須逐行應用

AAVE["sentence"] = AAVE.apply(lambda row: remove_emoji(row["sentence"]), axis=1)

uj5u.com熱心網友回復：

這行代碼可用于逐列洗掉表情符號操作

df.astype(str).apply(lambda x: x.str.encode('ascii', 'ignore').str.decode('ascii'))

請注意，它還會洗掉所有非英文字母和特殊字符，但如果您需要保留它們，我們可以編輯代碼

uj5u.com熱心網友回復：

你的功能對我有用：

arr = ['He better hurry amp; come back from playing cards', 'I ordered a new phone',
         'lol okay baby \ud83d\ude18\u2764\ud83d\ude0d', 'imma cry']
df = pd.DataFrame({"column1": [0, 1, 2, 3], "column2": arr})

df
   column1                                            column2
0        0  He better hurry amp; come back from playing cards
1        1                              I ordered a new phone
2        2                                lol okay baby \ud83d\ude18?\ud83d\ude0d
3        3                                           imma cry

df["column2"] = df["column2"].apply(decontracted)
df
   column1                                            column2
0        0  He better hurry amp; come back from playing cards
1        1                              I ordered a new phone
2        2                                     lol okay baby 
3        3                                           imma cry

文本如何存盤在資料框中可能是個問題嗎？

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/488326.html

標籤：Python 熊猫数据框

上一篇：計算分數變化并添加到新列pandas

下一篇：PythonPandas動態連接所有列的組合