我正在嘗試從熊貓資料框中的列中洗掉表情符號。使用此代碼:
def remove_emoji(string):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"] ", flags=re.UNICODE)
return emoji_pattern.sub(r'', string)
def decontracted(phrase):
# specific
phrase = phrase.rstrip()
phrase = ' '.join(phrase.split())
phrase = re.sub(r'\w :\/{2}[\d\w-] (\.[\d\w-] )*(?:(?:\/[^\s/]*))*', '', phrase)
phrase = re.sub('@[\w] ','',phrase)
phrase = re.sub(r'[^\x00-\x7f]',r'', phrase)
# general
phrase = re.sub('@[^\s] ','',phrase)
phrase = remove_accented_chars(phrase)
phrase = remove_special_characters(phrase)
phrase = remove_emoji(phrase)
return phrase
def remove_accented_chars(text):
new_text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
return new_text
def remove_special_characters(text):
# define the pattern to keep
pat = r'[^a-zA-z0-9.,!?/:;\"\'\s]'
return re.sub(pat, '', text)
將其應用于資料框列,如下所示:
AAVE["sentence"] = AAVE["sentence"].apply(decontracted)
['他最好快點 打牌回來','我訂購了一部新手機','大聲笑,寶貝 \ud83d\ude18\u2764\ud83d\ude0d','imma cry']
以上是我正在測驗的文本示例。\ud83d\ude18\u2764\ud83d\ude0d 不會被洗掉。
- - - - - - -編輯 - - - - - -
這是我用來加載 TSV 檔案中的資料的代碼:
AAVE = pd.read_csv('twitteraae_all_aa', sep='\t', on_bad_lines='skip')
columns = ['ID', 'Date', 'Num', 'Location','Num2', 'AA', 'Hispanic', 'Other', 'White']
AAVE.drop(columns, inplace=True, axis=1)
AAVE = AAVE.rename(columns={'Sentence': 'sentence'})
AAVE['label'] = 1
AAVE['sentence'] = AAVE['sentence'][0:391165].astype('string')
AAVE = AAVE.dropna()
AAVE['sentence1'] = AAVE['sentence'].astype('string').apply(decontracted).astype('string')
如果我創建一個字串陣列并應用 decontract 函式,該代碼將起作用,但如果我將它應用于資料框,我想要洗掉的所有其他東西都可以作業,但表情符號不行。
uj5u.com熱心網友回復:
你必須逐行應用
AAVE["sentence"] = AAVE.apply(lambda row: remove_emoji(row["sentence"]), axis=1)
uj5u.com熱心網友回復:
這行代碼可用于逐列洗掉表情符號操作
df.astype(str).apply(lambda x: x.str.encode('ascii', 'ignore').str.decode('ascii'))
請注意,它還會洗掉所有非英文字母和特殊字符,但如果您需要保留它們,我們可以編輯代碼
uj5u.com熱心網友回復:
你的功能對我有用:
arr = ['He better hurry amp; come back from playing cards', 'I ordered a new phone',
'lol okay baby \ud83d\ude18\u2764\ud83d\ude0d', 'imma cry']
df = pd.DataFrame({"column1": [0, 1, 2, 3], "column2": arr})
df
column1 column2
0 0 He better hurry amp; come back from playing cards
1 1 I ordered a new phone
2 2 lol okay baby \ud83d\ude18?\ud83d\ude0d
3 3 imma cry
df["column2"] = df["column2"].apply(decontracted)
df
column1 column2
0 0 He better hurry amp; come back from playing cards
1 1 I ordered a new phone
2 2 lol okay baby
3 3 imma cry
文本如何存盤在資料框中可能是個問題嗎?
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/488326.html
