我正在嘗試制作一個電子郵件抓取工具,它可以抓取某些電子郵件,尋找值以將它們存盤在 CSV 檔案中。我一直在嘗試很多事情來解決這個問題,但到目前為止沒有成功。
# Function to get email content part i.e its body part
def get_body(msg):
if msg.is_multipart():
return get_body(msg.get_payload(decode=True)).decode()
else:
return msg.get_payload(decode=True).decode()
# Function to search for a key value pair
def search(key, value, con):
result, data = con.search(None, key, '"{}"'.format(value))
return data
# Function to get the list of emails under this label
def get_emails(result_bytes):
print("get email")
msgs = [] # all the email data are pushed inside an array
for num in result_bytes[0].split():
typ, data = con.fetch(num, '(RFC822)')
msgs.append(data)
return msgs
# this is done to make SSL connection with GMAIL
con = imaplib.IMAP4_SSL(imap_url)
con.login(user, password)
con.select('Inbox')
msg_ids = get_emails(search('SUBJECT', 'TESTTITELPYTHON', con))
for msg in msg_ids[::-1]:
for sent in msg:
if type(sent) is tuple:
print(msg)
# encoding set as utf-8
content = sent[1], 'utf-8'
data = str(content)
# Handling errors related to unicodenecode
try:
indexstart = data.find("span")
data2 = data[indexstart 5: len(data)]
indexend = data2.find("</div>")
# printtng the required content which we need
# to extract from our email i.e our body
waarde = data2[0: indexend]
test_naam_1 = waarde.split("Naam: ",1)[1]
echte_naam = test_naam_1.split("Email: ",-1)[0]
email_test = waarde.split("Email: ",1)[1]
echte_email = email_test.split("Tel nr.: ",-1)[0]
tel_test = waarde.split("Tel nr.: ",1)[1]
echte_tel = tel_test.split("Onderwerp: ",-1)[0]
subj_test = waarde.split("Onderwerp: ",1)[1]
echte_subj = subj_test.split("Bericht: ",-1)[0]
print("---ADRESGEGEVENS---")
print("---Naam: " echte_naam "---")
print("---Naam: " echte_email "---")
print("---Naam: " echte_tel "---")
print("---Naam: " echte_subj "---")
現在在我的結果中,我仍然收到這些丑陋的換行符,它們在我的標記中如下所示:
[(b'12638 (RFC822 {1973}', b'MIME-Version: 1.0\r\nDate: Mon, 25 Oct 2021 16:41:46 0200\r\nMessage-ID: <CAJDn=xsVynQqp7BwYoGZB=v21-AAR5=xcMkQ8D2kXE7ZpYFNNQ@mail.example.com>\r\nSubject: TESTTITELPYTHON\r\nFrom: Patrick Merkx <[email protected]>\r\nTo: Patrick Merkx <[email protected]>\r\nContent-Type: multipart/alternative; boundary="00000000000042e6ae05cf2e5c7e"\r\n\r\n--00000000000042e6ae05cf2e5c7e\r\nContent-Type: text/plain; charset="UTF-8"\r\n\r\nContactformulier ingevuld door:\r\nNaam: Patrick Merkx\r\nEmail: [email protected]\r\nTel nr.: 0611381219\r\n\r\nOnderwerp: Nog een test\r\n\r\nBericht:\r\nBericht\r\n\r\n--00000000000042e6ae05cf2e5c7e\r\nContent-Type: text/html; charset="UTF-8"\r\nContent-Transfer-Encoding: quoted-printable\r\n\r\n<div dir=3D"ltr"><div><div dir=3D"ltr" class=3D"gmail_signature" data-smart=\r\nmail=3D"gmail_signature"><div dir=3D"ltr"><div><div dir=3D"ltr"><div><div d=\r\nir=3D"ltr"><div style=3D"font-stretch:normal;font-size:13.33px;line-height:=\r\n19.99px;background:none;border:0px rgb(34,34,34);width:600px;overflow:visib=\r\nle;min-height:0px;outline-width:0px"><span class=3D"gmail-il" style=3D"font=\r\n-size:small">Contactformulier</span><span style=3D"font-size:small">=C2=A0i=\r\nngevuld door:</span><br style=3D"font-size:small"><span style=3D"font-size:=\r\nsmall">Naam: Patrick Merkx</span><br style=3D"font-size:small"><span style=\r\n=3D"font-size:small">Email:=C2=A0</span><a href=3D"mailto:merkx.patrick@gma=\r\nil.com" target=3D"_blank" style=3D"font-size:small">[email protected]=\r\n</a><br style=3D"font-size:small"><span style=3D"font-size:small">Tel nr.: =\r\n0611381219</span><br style=3D"font-size:small"><br style=3D"font-size:small=\r\n"><span style=3D"font-size:small">Onderwerp: Nog een test</span><br style=\r\n=3D"font-size:small"><br style=3D"font-size:small"><span style=3D"font-size=\r\n:small">Bericht:</span><br style=3D"font-size:small"><span style=3D"font-si=\r\nze:small">Bericht</span><br></div></div></div></div></div></div></div></div=\r\n></div>\r\n\r\n--00000000000042e6ae05cf2e5c7e--'), b')']
class=3D"gmail-il" style=3D"font=\r\n-size:small">Contactformulier</span><span style=3D"font-size:small">=C2=A0i=\r\nngevuld door:</span><br style=3D"font-size:small"><span style=3D"font-size:=\r\nsmall">Naam: Patrick Merkx</span><br style=3D"font-size:small"><span style=\r\n=3D"font-size:small">Email:=C2=A0</span><a href=3D"mailto:merkx.patrick@gma=\r\nil.com" target=3D"_blank" style=3D"font-size:small">[email protected]=\r\n</a><br style=3D"font-size:small"><span style=3D"font-size:small">Tel nr.: =\r\n0611381219</span><br style=3D"font-size:small"><br style=3D"font-size:small=\r\n"><span style=3D"font-size:small">Onderwerp: Nog een test</span><br style=\r\n=3D"font-size:small"><br style=3D"font-size:small"><span style=3D"font-size=\r\n:small">Bericht:</span><br style=3D"font-size:small"><span style=3D"font-si=\r\nze:small">Bericht</span><br>
我也嘗試過剝離 body 標簽、解碼,也嘗試過多種解決方案,但到目前為止都不走運。我似乎無法以任何迄今為止已知的方式洗掉這些換行符。
我究竟做錯了什么?
uj5u.com熱心網友回復:
您正在查看帶有Content-Transfer-Encoding: quoted-printable. 解碼的正確方法是遍歷 MIME 結構并在進行時解釋各個部分。但是沒有必要明確地這樣做;Python 的email庫已經為您做到了這一點。
from email import message_from_bytes
from email.policy import default
...
msg_ids = get_emails(search('SUBJECT', 'TESTTITELPYTHON', con))
for msg in msg_ids[::-1]:
for sent in msg:
if type(sent) is tuple:
msg = message_from_bytes(sent[1], policy=default)
不幸的是,如果沒有這些訊息中 MIME 結構的示例,我無法準確地告訴您如何處理生成的訊息。可能您有類似“主要”MIME 正文部分的內容;msg.get_body(preferencelist=('html', 'plain'))會把它拉出來,get_content()結果會提取實際的身體部位。
該policy=default關鍵字引數選擇email.message.EmailMessage這是在Python 3.6在原有引進物件類email.message.Message從舊版本的物件。
更詳細地說,嘗試將原始電子郵件正文解碼為 UTF-8 是非常錯誤的。一個典型的 MIME 訊息有幾個部分,每個部分可能有不同的編碼,其中許多肯定不使用 UTF-8 作為它們的編碼(盡管它變得越來越普遍;但是,通常情況下,實際的 UTF-8 將是在內容傳輸編碼之后,它可以保護它在通過可能不是 8 位干凈的路線傳輸期間免受損壞)。
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/336775.html
