在SQLite資料庫中插入數百萬行，Python太慢了-有解無憂

對于我的國際象棋引擎，我使用統計資料來選擇最佳移動。我從數百萬個游戲中收集了它們。我對當前的移動、下一步以及在當前移動的情況下下一個移動的次數感興趣。

對于使用 Python 字典并將其與 pickle 一起存盤，檔案太大，并且很難用新游戲進行更新。所以我決定使用 SQLite。

我創建了一個類MovesDatabase：

class MovesDatabase:

def __init__(self, work_dir):
    self.con = sqlite3.connect(os.path.join(work_dir, "moves.db"))
    self.con.execute('PRAGMA temp_store = MEMORY')
    self.con.execute('PRAGMA synchronous = NORMAL')
    self.con.execute('PRAGMA journal_mode = WAL')
    self.cur = self.con.cursor()

    self.cur.execute("CREATE TABLE IF NOT EXISTS moves("
                     "move TEXT,"
                     "next TEXT,"
                     "count INTEGER DEFAULT 1);")

move并next以字串格式表示棋盤的狀態：FEN。例子：

rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR
r1b1k1nr/p2p1pNp/n2B4/1p1NP2P/6P1/3P1Q2/P1P1K3/q5b1
8/8/8/4p1K1/2k1P3/8/8/8 b

下面的方法負責獲取一個游戲檔案，如果這對 ( move, next) 是新的，則提取移動并插入，或者如果 ( move, next) 已經存在于資料庫中，則更新：

def insert_moves_from_file(self, file: str):
    print("Extracting moves to database from "   file)

    count = 0

    with open(file) as games_file:
        game = chess.pgn.read_game(games_file)

        while game is not None:
            batch = []
            board = game.board()
            state_one = board.fen().split(' ')[0]   ' '   board.fen().split(' ')[1]

            for move in game.mainline_moves():
                board.push(move)
                fen = board.fen().split(' ')
                state_two = fen[0]   ' '   fen[1]

                res = self.cur.execute("SELECT * FROM moves WHERE move=? AND next=?",
                                       (state_one, state_two))
                res = res.fetchall()

                if len(res) != 0:
                    self.cur.execute("UPDATE moves SET count=count 1 WHERE move=? AND next=?",
                                     (state_one, state_two))
                else:
                    batch.append((state_one, state_two))

                state_one = state_two

            self.cur.executemany("INSERT INTO moves(move, next) VALUES"
                                 "(?, ?)", batch)
            count  = 1
            print('\r' "%d games was add to the database.." % (count   1), end='')
            game = chess.pgn.read_game(games_file)

    self.con.commit()
    print("\n Finished!")

這對夫婦 ( move, next) 是獨一無二的。

我用一個包含大約 400 萬個 ( move, next) 的檔案進行了測驗。它開始以 3.000 行/秒的速度插入/更新，但在 50K 行時它會減慢到 100 行/秒并繼續下降。我設計這種方法來處理多個游戲檔案，這就是為什么我首先選擇 SQL 資料庫。

uj5u.com熱心網友回復：

這里不是INSERT很慢。

您的move和next列沒有被索引，因此任何SELECT或UPDATE涉及這些列都需要全表掃描。

如果(move, next)始終是唯一的，您將需要在其UNIQUE上添加索引。它還將自動使查詢move/next對的查詢更快（但不一定是那些僅查詢這兩列之一的查詢）。

要在現有表上創建該索引，

CREATE UNIQUE INDEX ix_move_next ON moves (move, next);

最后，一旦你有了那個索引，你也可以用 upsert 擺脫整個SELECT/UPDATE東西：

INSERT INTO moves (move, next) VALUES (?, ?) ON CONFLICT (move, next) DO UPDATE SET count = count   1;

這是一個輕微的重構，在我的機器上插入了大約 6200 次移動/秒。（它需要tqdm一個漂亮的進度條庫和一個pgns/包含 PGN 檔案的目錄。）

import glob
import sqlite3
import chess.pgn
import tqdm
from chess import WHITE


def board_to_state(board):
    # These were extracted from the implementation of `board.fen()`
    # so as to avoid doing extra work we don't need.
    bfen = board.board_fen(promoted=False)
    turn = ("w" if board.turn == WHITE else "b")
    return f'{bfen} {turn}'


def insert_game(cur, game):
    batch = []
    board = game.board()
    state_one = board_to_state(board)
    for move in game.mainline_moves():
        board.push(move)
        state_two = board_to_state(board)
        batch.append((state_one, state_two))
        state_one = state_two
    cur.executemany("INSERT INTO moves (move, next) VALUES (?, ?) ON CONFLICT (move, next) DO UPDATE SET count = count   1", batch)
    n_moves = len(batch)
    return n_moves


def main():
    con = sqlite3.connect("moves.db")
    con.execute('PRAGMA temp_store = MEMORY')
    con.execute('PRAGMA synchronous = NORMAL')
    con.execute('PRAGMA journal_mode = WAL')
    con.execute('CREATE TABLE IF NOT EXISTS moves(move TEXT,next TEXT,count INTEGER DEFAULT 1);')
    con.execute('CREATE UNIQUE INDEX IF NOT EXISTS ix_move_next ON moves (move, next);')

    cur = con.cursor()

    for pgn_file in sorted(glob.glob("pgns/*.pgn")):
        with open(pgn_file) as games_file:
            n_games = 0
            with tqdm.tqdm(desc=pgn_file, unit="moves") as pbar:
                while (game := chess.pgn.read_game(games_file)):
                    n_moves = insert_game(cur, game)
                    n_games  = 1
                    pbar.set_description(f"{pgn_file} ({n_games} games)", refresh=False)
                    pbar.update(n_moves)
            con.commit()


if __name__ == '__main__':
    main()

轉載請註明出處，本文鏈接：https://www.uj5u.com/qukuanlian/510188.html

標籤：Pythonsqlite棋

上一篇：使用EF插入SqlLite資料庫總是回傳1的Id

下一篇：如何查看臨時表？