我有兩個 CSV,一個是主資料,另一個是組件資料,主資料有兩行和兩列,其中組件資料有 5 行和兩列。


我試圖在標記化、詞干化和詞形還原之后找到它們之間的余弦相似度,然后將相似度索引附加到新列,我無法將相應的值附加到資料幀中的列中進一步需要轉換為 CSV。
我的方法:
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer,WordNetLemmatizer
from collections import Counter
import pandas as pd
portStemmer=PorterStemmer()
wordNetLemmatizer = WordNetLemmatizer()
fields = ['Sentences']
cosineSimilarityList = []
def fetchLemmantizedWords():
eliminatePunctuation = re.sub('[^a-zA-Z]', ' ',value)
convertLowerCase = eliminatePunctuation.lower()
tokenizeData = convertLowerCase.split()
eliminateStopWords = [word for word in tokenizeData if not word in set(stopwords.words('english'))]
stemWords= list(set([portStemmer.stem(value) for value in eliminateStopWords]))
wordLemmatization = [wordNetLemmatizer.lemmatize(x) for x in stemWords]
return wordLemmatization
def fetchCosine(eachMasterData,eachComponentData):
masterDataValues = Counter(eachMasterData)
componentDataValues = Counter(eachComponentData)
bagOfWords = list(masterDataValues.keys() | componentDataValues.keys())
masterDataVector = [masterDataValues.get(bagOfWords, 0) for bagOfWords in bagOfWords]
componentDataVector = [componentDataValues.get(bagOfWords, 0) for bagOfWords in bagOfWords]
masterDataLength = sum(contractElement*contractElement for contractElement in masterDataVector) ** 0.5
componentDataLength = sum(questionElement*questionElement for questionElement in componentDataVector) ** 0.5
dotProduct = sum(contractElement*questionElement for contractElement,questionElement in zip(masterDataVector, componentDataVector))
cosine = int((dotProduct / (masterDataLength * componentDataLength))*100)
return cosine
masterData = pd.read_csv('C:\\Similarity\\MasterData.csv', skipinitialspace=True)
componentData = pd.read_csv('C:\\Similarity\\ComponentData.csv', skipinitialspace=True)
for value in masterData['Sentences']:
eachMasterData = fetchLemmantizedWords()
for value in componentData['Sentences']:
eachComponentData = fetchLemmantizedWords()
cosineSimilarity = fetchCosine(eachMasterData,eachComponentData)
cosineSimilarityList.append(cosineSimilarity)
for value in cosineSimilarityList:
componentData = componentData.append(pd.DataFrame(cosineSimilarityList, columns=['Cosine Similarity']), ignore_index=True)
#componentData['Cosine Similarity'] = value
將 df 轉換為 CSV 后的預期輸出,

在將值附加到資料框時面臨問題,請幫助我解決此問題。謝謝。
uj5u.com熱心網友回復:
這是我想出的:
樣品設定
csv_master_data = \
"""
SI.No;Sentences
1;Emma is writing a letter.
2;We wake up early in the morning.
"""
csv_component_data = \
"""
SI.No;Sentences
1;Emma is writing a letter.
2;We wake up early in the morning.
3;Did Emma Write a letter?
4;We sleep early at night.
5;Emma wrote a letter.
"""
import pandas as pd
from io import StringIO
df_md = pd.read_csv(StringIO(csv_master_data), delimiter=';')
df_cd = pd.read_csv(StringIO(csv_component_data), delimiter=';')
我們最終得到 2 個資料幀(顯示df_cd):
| SI.No | 句子 | |
|---|---|---|
| 0 | 1 | 艾瑪正在寫一封信。 |
| 1 | 2 | 我們一大早就起床。 |
| 2 | 3 | 艾瑪寫信了嗎? |
| 3 | 4 | 我們晚上睡得很早。 |
| 4 | 5 | 艾瑪寫了一封信。 |
我用以下虛擬函式替換了您使用的 2 個函式:
import random
def fetchLemmantizedWords(words):
return [random.randint(1,30) for x in words]
def fetchCosine(lem_md, lem_cd):
return 100 if len(lem_md) == len(lem_cd) else random.randint(0,100)
處理資料
首先,我們fetchLemmantizedWords在每個資料幀上應用該函式。句子的正則運算式替換、小寫和拆分由 Pandas 完成,而不是在函式本身中完成。
通過首先使句子小寫,我們可以將正則運算式簡化為只考慮小寫字母。
for df in (df_md, df_cd):
df['lem'] = df.apply(lambda x: fetchLemmantizedWords(x.Sentences
.lower()
.replace(r'[^a-z]', ' ')
.split()),
result_type='reduce',
axis=1)
結果df_cd:
| SI.No | 句子 | 萊姆 | |
|---|---|---|---|
| 0 | 1 | 艾瑪正在寫一封信。 | [29、5、4、9、28] |
| 1 | 2 | 我們一大早就起床。 | [16、8、21、14、13、4、6] |
| 2 | 3 | 艾瑪寫信了嗎? | [30、9、23、16、5] |
| 3 | 4 | 我們晚上睡得很早。 | [8、25、24、7、3] |
| 4 | 5 | 艾瑪寫了一封信。 | [30、30、15、7] |
接下來,我們使用交叉連接來創建一個包含所有可能的資料組合的資料md框cd。
df_merged = pd.merge(df_md[['SI.No', 'lem']],
df_cd[['SI.No', 'lem']],
how='cross',
suffixes=('_md','_cd')
)
df_merged內容:
| SI.No_md | lem_md | SI.No_cd | lem_cd | |
|---|---|---|---|---|
| 0 | 1 | [14, 22, 9, 21, 4] | 1 | [3, 4, 8, 17, 2] |
| 1 | 1 | [14, 22, 9, 21, 4] | 2 | [29, 3, 10, 2, 19, 18, 21] |
| 2 | 1 | [14, 22, 9, 21, 4] | 3 | [20, 22, 29, 4, 3] |
| 3 | 1 | [14, 22, 9, 21, 4] | 4 | [17, 7, 1, 27, 19] |
| 4 | 1 | [14, 22, 9, 21, 4] | 5 | [17, 5, 3, 29] |
| 5 | 2 | [12, 30, 10, 11, 7, 11, 8] | 1 | [3, 4, 8, 17, 2] |
| 6 | 2 | [12, 30, 10, 11, 7, 11, 8] | 2 | [29, 3, 10, 2, 19, 18, 21] |
| 7 | 2 | [12, 30, 10, 11, 7, 11, 8] | 3 | [20, 22, 29, 4, 3] |
| 8 | 2 | [12, 30, 10, 11, 7, 11, 8] | 4 | [17, 7, 1, 27, 19] |
| 9 | 2 | [12, 30, 10, 11, 7, 11, 8] | 5 | [17, 5, 3, 29] |
Next, we calculate the cosine value:
df_merged['cosine'] = df_merged.apply(lambda x: fetchCosine(x.lem_md,
x.lem_cd),
axis=1)
In the last step, we pivot the data and merge the original df_cd with the calculated results :
pd.merge(df_cd.drop(columns='lem').set_index('SI.No'),
df_merged.pivot_table(index='SI.No_cd',
columns='SI.No_md').droplevel(0, axis=1),
how='inner',
left_index=True,
right_index=True)
Result (again, these are dummy calculations):
| SI.No | Sentences | 1 | 2 |
|---|---|---|---|
| 1 | Emma is writing a letter. | 100 | 64 |
| 2 | We wake up early in the morning. | 63 | 100 |
| 3 | Did Emma Write a letter? | 100 | 5 |
| 4 | We sleep early at night. | 100 | 17 |
| 5 | Emma wrote a letter. | 35 | 9 |
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/445730.html
標籤:Python python-3.x 列表 数据框 CSV
