CSV 檔案
我有一個大約 1.3 GB 的 CSV 檔案:
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a---- 10/4/2021 1:23 PM 1397998768 XBTUSD.csv
這是 Kraken 交易所位元幣交易資料的完整串列。
CSV 中的資料如下所示:
> Get-Content .\XBTUSD.csv | Select-Object -First 10
1381095255,122.00000,0.10000000
1381179030,123.61000,0.10000000
1381201115,123.91000,1.00000000
1381201115,123.90000,0.99160000
1381210004,124.19000,1.00000000
1381210004,124.18000,1.00000000
1381311039,124.01687,1.00000000
1381311093,124.01687,1.00000000
1381311094,123.84000,0.82300000
1381431835,125.85000,1.00000000
有關該檔案的更多資訊,請訪問:
https://support.kraken.com/hc/en-us/articles/360047543791-Downloadable-historical-market-data-time-and-sales
該檔案可以從這里下載:
https://drive.google.com/drive/folders/1jI3mZvrPbInNAEaIOoMbWvFfgRDZ44TT
請參閱檔案XBT.zip。在那個檔案里面是XBTUSD.csv.
基線測驗 - 直接匯入到 sqlite
如果我在 sqlite 中創建下表:
CREATE TABLE CsvTrades (
"TimeStamp" TEXT NOT NULL,
"Price" TEXT NOT NULL,
"Volume" TEXT NOT NULL
);
并運行以下命令以匯入 CSV(以及所需的時間):
$a = Get-Date
sqlite3.exe .\kraken-trades.db -cmd '.mode csv' '.import C:/Users/dharm/XBTUSD.csv CsvTrades'
$b = Get-Date
($b - $a).TotalMinutes
我得到以下資訊:
1.56595191666667
1.5 分鐘。不錯!
使用 EF 核心
在下面的代碼中,我使用的是CsvHelper包:
https://joshclose.github.io/CsvHelper/getting-started/
Here's a class for the CSV file rows:
public class CsvRow
{
[CsvHelper.Configuration.Attributes.Index(0)]
public long TimeStamp { get; set; }
[CsvHelper.Configuration.Attributes.Index(1)]
public decimal Price { get; set; }
[CsvHelper.Configuration.Attributes.Index(2)]
public decimal Quantity { get; set; }
}
Here's a class for the Trade entity:
[Index(nameof(TimeStamp))]
public class Trade
{
public int Id { get; set; }
public decimal Price { get; set; }
public decimal Quantity { get; set; }
public DateTime TimeStamp { get; set; }
}
The DbContext is straightforward:
public class AppContext : DbContext
{
public DbSet<Trade> Trades { get; set; }
protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
{
var folder = Environment.SpecialFolder.LocalApplicationData;
var path = Environment.GetFolderPath(folder);
var db_path = $"{path}{System.IO.Path.DirectorySeparatorChar}kraken-trades.db";
optionsBuilder.UseSqlite($"Data Source={db_path}");
}
}
And finally, the function that performs the import:
void initialize_from_csv()
{
var config = new CsvConfiguration(CultureInfo.InvariantCulture)
{
HasHeaderRecord = false
};
using (var reader = new StreamReader(@"C:\Users\dharm\XBTUSD.csv"))
using (var csv = new CsvReader(reader, config))
{
var records = csv.GetRecords<CsvRow>().Select(row => new Trade()
{
Price = row.Price,
Quantity = row.Quantity,
TimeStamp = DateTimeOffset.FromUnixTimeSeconds(row.TimeStamp).UtcDateTime
});
using (var db = new AppContext())
{
Console.WriteLine(DateTime.Now);
while (true)
{
//var items = records.Take(10_000).ToList();
var items = records.Take(100_000).ToList();
if (items.Any() == false) break;
Console.WriteLine("{0:yyyy-MM-dd}", items[0].TimeStamp);
db.AddRange(items);
db.SaveChanges();
}
Console.WriteLine(DateTime.Now);
}
}
}
The question
When I let this run, it does indeed proceed to add the items to the database. However, it's pretty slow; I haven't timed it to completion but I could see it taking over an hour.
Is there a good way to make this faster while still using EF Core?
Notes
The code referenced above is available here in a single file:
https://github.com/dharmatech/kraken-trades-database/blob/003-minimal/KrakenTradesDatabase/Program.cs
It's a .NET 6 project. Let me know if you have any issues building and running it.
Timing
I added some code to time the batch adds. It looks like it's taking around 7 seconds per 100,000 records.
Starting batch at 2013-10-06. Batch took 00:00:08.7689932.
Starting batch at 2015-12-08. Batch took 00:00:06.7453421.
Starting batch at 2016-04-19. Batch took 00:00:06.7833506.
Starting batch at 2016-06-25. Batch took 00:00:06.7083806.
Starting batch at 2016-08-22. Batch took 00:00:06.7826717.
Starting batch at 2016-11-20. Batch took 00:00:06.4212123.
wc says there are 41,695,261 lines:
$ wc -l XBTUSD.csv
41695261 XBTUSD.csv
So at that rate, it would take around 48 minutes.
Why EF Core?
Some folks have asked, why use EF Core for this? Why not just a direct import?
The example above was intentionally simplified to focus on the speed of the import.
I have more elaborate versions where there are relationships with other entities. In that case:
Setting up the other tables and foreign key properties is more straightforward with EF Core.
I can more easily switch between database backends (SQL Server, PostgreSQL, sqlite).
See for example this branch where multiple symbols are imported. There's a relationship between the Trade and Symbol there. There could be other relationships as well.
https://github.com/dharmatech/kraken-trades-database/blob/006/KrakenTradesDatabase/Program.cs
uj5u.com熱心網友回復:
EFCore.BulkExtensions
使用以下內容:
https://github.com/borisdj/EFCore.BulkExtensions
然后改變這一行:
db.AddRange(items);
到:
db.BulkInsert(items);
使匯入時間從 48 分鐘縮短到 5.7 分鐘。
此版本的專案可在此處獲得:
https://github.com/dharmatech/kraken-trades-database/blob/004-bulk-extensions/KrakenTradesDatabase/Program.cs
謝謝
感謝 Caius Jard在上面的評論中建議使用EFCore.BulkExtensions。
uj5u.com熱心網友回復:
這是一個完整的 C# (10.0) 程式,它比 sqlite3 工具更快地插入 CSV 資料。它使用我的Sylvan.Data.Csv庫,這是最快的 .NET CSV 決議器。
在我的機器上,sqlite3 將資料1:07.6插入到1:02.9.
雖然這不符合您“仍在使用 EFCore”的要求,但我認為性能上的差異不言而喻。
套餐:
<PackageReference Include="Sylvan.Data.Csv" Version="1.1.9" />
<PackageReference Include="System.Data.SQLite" Version="1.0.115.5" />
代碼:
using System.Collections.ObjectModel;
using System.Data.Common;
using System.Data.SQLite;
using System.Diagnostics;
using Sylvan.Data.Csv;
var sw = Stopwatch.StartNew();
var conn = new SQLiteConnection("Data Source=test.db");
conn.Open();
var data = CsvDataReader.Create("xbtusd.csv", new CsvDataReaderOptions { HasHeaders = false });
// create the target table
{
using var cmd = conn.CreateCommand();
var tbl = "create table CsvTrades (TimeStamp TEXT NOT NULL, Price TEXT NOT NULL, Volume TEXT NOT NULL)";
cmd.CommandText = tbl;
cmd.ExecuteNonQuery();
}
// get the schema for the target table.
ReadOnlyCollection<DbColumn> ss;
{
using var cmd = conn.CreateCommand();
cmd.CommandText = "select * from CsvTrades limit 0;";
var r = cmd.ExecuteReader();
ss = r.GetColumnSchema();
}
// create the parameterized insert command
var cmdW = new StringWriter();
cmdW.Write("insert into CsvTrades values(");
int i = 0;
foreach (var c in ss)
{
if (i > 0)
cmdW.Write(",");
cmdW.Write("$p" i );
}
cmdW.Write(");");
var cmdt = cmdW.ToString();
// insert CSV data.
using (var tx = conn.BeginTransaction())
{
var cmd = conn.CreateCommand();
cmd.CommandText = cmdt;
for (i = 0; i < data.FieldCount; i )
{
var p = cmd.CreateParameter();
p.ParameterName = "$p" i;
cmd.Parameters.Add(p);
}
cmd.Prepare();
while (data.Read())
{
for (i = 0; i < data.FieldCount; i )
{
cmd.Parameters[i].Value = data.GetValue(i);
}
cmd.ExecuteNonQuery();
}
tx.Commit();
}
sw.Stop();
Console.WriteLine($"Inserted {data.RowNumber} records in {sw.Elapsed}");
更新:我意識到我的代碼沒有做它需要做的一切:它沒有將 unix 秒轉換為日期時間。修改插入回圈如下:
while (data.Read())
{
cmd.Parameters[0].Value = DateTime.UnixEpoch.AddSeconds(data.GetInt64(0));
cmd.Parameters[1].Value = data.GetString(1);
cmd.Parameters[2].Value = data.GetString(2);
cmd.ExecuteNonQuery();
}
這將速度減慢到 1:17.5,這比 sqlite3 稍慢,但隨后 sqlite3 插入不進行資料轉換,因此您最終會得到原始整數(長)值。
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/378631.html
標籤:c# entity-framework entity-framework-core
下一篇:WPF組合框-默認選擇第一項
