我有包含結構化條目的期刊書目的 OCR 文本。我想使用Invisible XML標準來提取和決議條目。
示例輸入:
1 2 Hype. 1990?- 1993. Frequency: Bimonthly. River Edge,
NJ. Published by Word Up! Video, Inc. Last issue 66 pages.
Height 28 cm. Line drawings; Photographs (some in color);
Commercial advertising; Table of contents. Previous editor(s):
Marica A. Cole. ISSN 1056-4632. LC card no. sn91-1965.
OCLC no. 23715422. Subject focus and/or Features: Hip hop
culture, Music, Rap music.
WHi v.l, n.6; v.2, n.5 Pam 01-5450 Aug, 1992; Aug, 1993
6561 The Zora Neale Hurston Forum. 1986-. Frequency:
Semiannual. Ruth T. Sheffey, Editor, The Zora Neale Hurston
Forum, P.O. Box 550, Morgan State University, Baltimore,
MD 21239. $15 for individuals and institutions. Telephone:
(301) 444-3435. Published by Zora Neale Hurston Society.
Last issue 69 pages. Last volume 142 pages. Height 23 cm.
Photographs; Table of contents. ISSN 1051-6867. LC card no.
90-649339. OCLC no. 15610848. Subject focus and/or Features: Hurston, Zora Neale, Literature, Literary criticism.
MdBMC v.l, n.l-v.8, n.2 Special Collections Fall, 1986-Spring,
1994
TxDw v.l, n.l; v.2, n.l Woman’s Collection Fall, 1986; Fall, 1987
WU v.l, n.l- AP/Z893/N345 Fall, 1986
6562 Zwanna: Son of Zulu. 1993-. Frequency: Unknown.
Nabile P. Hage, Editor, Zwanna, P.O. Box 38261, Atlanta, GA
30334. Published by Dark Zulu Lies Comics, Inc. Last issue 32
pages. Height 28 cm. Line drawings (some in color); Commercial advertising. OCLC no. 28389961. Subject focus and/or
Features: Comic books, strips, etc.
WHi v.l, n.l Pam 00-305 Apr/May, 1993
每個條目都以一個條目號開頭,后跟一個或多個空格字符,然后是用換行符分隔的描述性文本。
iXML 語法
data: entry .
entry: -#a, entrynum, " " , content .
entrynum: -digit .
digit: ["1"-"9"] .
content: ~[] ; -#a .
對 iXML 語法的最初嘗試產生了模棱兩可的決議(使用CoffeePot iXML 處理器)。
輸出
<data xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
<entry>
<entrynum>1</entrynum>
<content>2 Hype. 1990?- 1993. Frequency: Bimonthly. River Edge, NJ. Published by Word Up! Video,
Inc. Last issue 66 pages. Height 28 cm. Line drawings; Photographs (some in color); Commercial
advertising; Table of contents. Previous editor(s): Marica A. Cole. ISSN 1056-4632. LC card
no. sn91-1965. OCLC no. 23715422. Subject focus and/or Features: Hip hop culture, Music, Rap
music. WHi v.l, n.6; v.2, n.5 Pam 01-5450 Aug, 1992; Aug, 1993 6561 The Zora Neale Hurston
Forum. 1986-. Frequency: Semiannual. Ruth T. Sheffey, Editor, The Zora Neale Hurston Forum,
P.O. Box 550, Morgan State University, Baltimore, MD 21239. $15 for individuals and
institutions. Telephone: (301) 444-3435. Published by Zora Neale Hurston Society. Last issue
69 pages. Last volume 142 pages. Height 23 cm. Photographs; Table of contents. ISSN 1051-6867.
LC card no. 90-649339. OCLC no. 15610848. Subject focus and/or Features: Hurston, Zora Neale,
Literature, Literary criticism. MdBMC v.l, n.l-v.8, n.2 Special Collections Fall, 1986-Spring,
1994 TxDw v.l, n.l; v.2, n.l Woman’s Collection Fall, 1986; Fall, 1987 WU v.l, n.l-
AP/Z893/N345 Fall, 1986</content>
</entry>
<entry>
<entrynum>6562</entrynum>
<content>Zwanna: Son of Zulu. 1993-. Frequency: Unknown. Nabile P. Hage, Editor, Zwanna, P.O.
Box 38261, Atlanta, GA 30334. Published by Dark Zulu Lies Comics, Inc. Last issue 32 pages.
Height 28 cm. Line drawings (some in color); Commercial advertising. OCLC no. 28389961.
Subject focus and/or Features: Comic books, strips, etc. WHi v.l, n.l Pam 00-305 Apr/May, 1993
</content>
</entry>
</data>
首先,我想了解如何對條目進行分塊,然后開始決議內容:例如,每個條目編號后跟一個或多個空格,然后是字母數字標題,后跟句點等。
uj5u.com熱心網友回復:
“也許。” iXML 的一大優勢是它可以處理歧義。這使得語法更容易撰寫。如果模棱兩可的選擇同樣有效,或者您不在乎選擇了哪個模棱兩可的選擇,那么它的效果非常好。
對于書目資料,我懷疑某些選擇比其他選擇更有效,您確實關心選擇了哪個選擇,這使得它更難。我還打賭,由于 OCR 不完善,因此存在很多歧義。
我認為單個 iXML 語法不會決議輸入并準確生成您想要的輸出,但它可能會構成一些更廣泛策略的有用部分。我首先嘗試將參考書目分成單獨的條目,將語法限制為單個條目。然后我可能會看看我是否可以計算出不同類別的條目(書籍、雜志、期刊等),并且每個條目可能有不同的語法。
祝你好運!
uj5u.com熱心網友回復:
你的語法非常非常模棱兩可,因為“~[]”包含#a,所以決議輸入的方式有幾十種。您必須確定如何明確識別條目的開頭,如果這是“如果它以數字開頭”,那么您還必須防止以數字開頭的行被識別為“內容”,例如,
content: line .
line: ~["0"-"9"], ~[#a]*, #a.
如果您想追蹤歧義,可以嘗試我的實作(https://homepages.cwi.nl/~steven/ixml/tutorial/run.html),它比 Norm 慢得多,但提供了有關源的潛在有用資訊的歧義。
這是您的內容的合理第一次嘗試,但請注意,內容中的唯一 1994 被視為條目號:
ocr: entry .
entry: numbered, unnumbered*.
-numbered: number, (line*; -#a), blank-line.
-blank-line: -#a.
-line: ~[#a] , -#a.
@number: ["0"-"9"] , -" ".
-unnumbered: ~["0"-"9"; #a], line , blank-line.
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/517698.html
標籤:xml语法文本解析
