改進大型時間序列資料集的查詢分組-有解無憂

我有一個包含一整年時間序列資料的大表，每天有 24 個值（每小時一個），用于多個客戶。

客戶ID	日期	價值
001	2020-01-01 00:00:00	xx
001	2020-01-01 00:01:00	xx
001	2020-01-01 00:02:00	xx
...	...	...
002	2020-01-01 00:00:00	xx
002	2020-01-01 00:01:00	xx
002	2020-01-01 00:02:00	xx
...	...	...

我目前將整個資料集存盤在一個 sqlite 表中，然后我從一個 python 應用程式中查詢，您可以在其中選擇不同形式的可視化（按月顯示每年 12 個值或按天顯示每年 365 個值） .

CREATE TABLE "data" (
    "index" INTEGER NOT NULL,
    "customer_id"   INTEGER NOT NULL,
    "date"  DATETIME NOT NULL,
    "value" FLOAT NOT NULL,
    "year"  INTEGER NOT NULL,
    "month" INTEGER NOT NULL,
    "day"   INTEGER NOT NULL,
    PRIMARY KEY("index")
);

CREATE INDEX "idx_data_ym" ON "data" (
    "year",
    "month"
);

CREATE INDEX "idx_data_ymd" ON "data" (
    "year",
    "month",
    "day"
);

我可以用來顯示每月資料的一個天真的查詢是

SELECT date, sum(value) FROM data GROUP BY CAST(STRFTIME('%Y', date) AS INTEGER), CAST(STRFTIME('%m', date) AS INTEGER)

這在大型資料集上相當慢，據我了解不能使用索引，這就是我存盤的原因year，month并且day作為一個額外的列，以便我可以使用

SELECT date, sum(value) from data GROUP BY year, month

結果：

日期	和
2017-01-01 00:00:00	年年
2017-01-02 00:00:00	年年
..	..
2018-01-01 00:00:00	年年
2018-01-02 00:00:00	年年
..	..

For a small test dataset (100 customers, 24 hours per value, two years = 100 * 24 * 365 * 2 = 1.752.000 records), the first query takes about 20s while the second one takes only 1.8s.

Checking the queries with EXPLAIN QUERY PLAN, the second query is using the idx_data_ym index, which is what I wanted, while the first query does not use an index.

Output EXPLAIN QUERY PLAN for the first query:

id	parent	notused	detail
6	0	0	SCAN TABLE data
8	0	0	USE TEMP B-TREE FOR GROUP BY

Output for the second query:

id	parent	notused	detail
7	0	0	SCAN TABLE data USING INDEX idx_data_ym

Now I'm wondering, 1.8s may be okay, but the dataset in production will be much larger which will slow down the application tremendously. Grouping by year, month, day to get the sum values for each day is even slower.

Is there a way to improve my query performance? Am I doing something complety wrong with understanding how to aggregate the data?

Thanks for your help!

uj5u.com熱心網友回復：

不需要額外的年、月和日列。

您可以在運算式上設定索引：

CREATE TABLE "data" (
    "index" INTEGER NOT NULL,
    "customer_id" INTEGER NOT NULL,
    "date" TEXT NOT NULL, -- there is no DATETIME data type in SQLite
    "value" FLOAT NOT NULL,
    PRIMARY KEY("index")
);

CREATE INDEX "idx_data_ym" ON "data"(strftime('%Y-%m', date));

CREATE INDEX "idx_data_ymd" ON "data"(date(date)); -- equivalent of strftime('%Y-%m-%d', date)

對于這些查詢：

SELECT STRFTIME('%Y-%m', date) AS year_month, 
       SUM(value) AS total
FROM data 
GROUP BY year_month;

SELECT date(date) AS year_month_day, 
       SUM(value) AS total
FROM data 
GROUP BY year_month_day;

將使用正確的索引。

請參閱演示。

索引是加快操作速度的最佳方法。

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/422605.html

標籤：

上一篇：將表名傳遞給管道scrapypython

下一篇：在sqlite中將csv轉換為db