本文介紹在C#程式中(附VB.NET代碼)提取PDF中的表格的方法,呼叫Spire.PDF for .NET提供的提取表格的類以及方法等來獲取表格單元格中的文本內容;代碼內容中涉及到的主要類及方法歸納如下表,供參考:
|
型別 |
描述 |
|
PdfDocument Class |
Represents a pdf document model. |
|
PdfDocument.LoadFromFile(string filename) Method |
Loads a PDF document. |
|
PdfTableExtractor Class |
Represents the PDF table extractor. |
|
PdfTable Class |
Defines a PDF table. |
|
PdfTableExtractor. ExtractTable(int pageIndex) Method |
Extracts table from page. |
|
PdfTable.GetText(int rowIndex,int columnIndex) Method |
Gets Text in cell. |
|
File.WriteAllText() Method |
Saves extracted text in table to a .txt file. |
環境配置
- Visual Studio 2017
- .net framework 4.6.1
- PDF測驗檔案
- 庫:Spire.PDF for .NET 7.10.4
參考dll檔案的2種方法:
方法1:通過NuGet安裝,
【步驟】
滑鼠右鍵點擊“參考”,“管理NuGet程式包”,

點擊“瀏覽”,在搜索框中輸入,點擊“安裝”,

或者使用PM控制臺安裝:
PM>Install-Package Spire.PDF -Version 7.10.4
方法2:手動添加參考,
【步驟】
滑鼠右鍵點擊“參考”,“添加參考”,

點擊“瀏覽”,“瀏覽”,將本地路徑下的dll檔案(需提前下載到本地,并解壓)添加到參考串列


點擊OK,完成參考:

代碼示例
C#
using Spire.Pdf; using Spire.Pdf.Utilities; using System.IO; using System.Text; namespace ExtractTable { class Program { static void Main(string[] args) { //加載PDF檔案 PdfDocument pdf = new PdfDocument(); pdf.LoadFromFile("sample.pdf"); StringBuilder builder = new StringBuilder(); //抽取表格 PdfTableExtractor extractor = new PdfTableExtractor(pdf); PdfTable[] tableLists = null; for (int pageIndex = 0; pageIndex < pdf.Pages.Count; pageIndex++) { tableLists = extractor.ExtractTable(pageIndex); if (tableLists != null && tableLists.Length > 0) { foreach (PdfTable table in tableLists) { int row = table.GetRowCount(); int column = table.GetColumnCount(); for (int i = 0; i < row; i++) { for (int j = 0; j < column; j++) { string text = table.GetText(i, j); builder.Append(text + " "); } builder.Append("\r\n"); } } } } //保存提取的表格內容到txt檔案 File.WriteAllText("ExtractedTable.txt", builder.ToString()); } } }
VB.NET
Imports Spire.Pdf Imports Spire.Pdf.Utilities Imports System.IO Imports System.Text Namespace ExtractTable Class Program Private Shared Sub Main(args As String()) '加載PDF檔案 Dim pdf As New PdfDocument() pdf.LoadFromFile("sample.pdf") Dim builder As New StringBuilder() '抽取表格 Dim extractor As New PdfTableExtractor(pdf) Dim tableLists As PdfTable() = Nothing For pageIndex As Integer = 0 To pdf.Pages.Count - 1 tableLists = extractor.ExtractTable(pageIndex) If tableLists IsNot Nothing AndAlso tableLists.Length > 0 Then For Each table As PdfTable In tableLists Dim row As Integer = table.GetRowCount() Dim column As Integer = table.GetColumnCount() For i As Integer = 0 To row - 1 For j As Integer = 0 To column - 1 Dim text As String = table.GetText(i, j) builder.Append(text & Convert.ToString(" ")) Next builder.Append(vbCr & vbLf) Next Next End If Next '保存提取的表格內容到txt檔案 File.WriteAllText("ExtractedTable.txt", builder.ToString()) End Sub End Class End Namespace
表格內容提取結果:

其他注意事項:
- 代碼中的PDF檔案以及生成的.txt檔案路徑為 F:\VS2017Project\ExtractTable\bin\Debug\sample.pdf 和 F:\VS2017Project\ ExtractTable\bin\Debug\ExtractedTable.txt,檔案路徑也可以自定義為其他路徑,
- 注意使用的dll檔案版本,低于7.10.4的其他版本不支持提取表格,
—End—
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/325205.html
標籤:C#
