基于字串在SQL（雪花）中選擇一行-有解無憂

所以我有一個頑固的人，我一直在絞盡腦汁。

假設我有一張表，如下所示：

ID      Group                 Timestamp   Data
001         A   2021-04-13 12:51:12.063   content121
001  A-Direct   2021-04-13 12:52:13.063   content121
002  A-Direct   2021-04-13 12:50:14.063   content133
003  B-Direct   2021-04-13 12:55:12.063   content132
003         B   2021-04-13 12:56:11.063   content142
003        BA   2021-04-13 12:57:22.063   content153
004         D   2021-04-13 12:10:23.063   content113
004         C   2021-04-13 12:11:43.063   content144
005         C   2021-04-13 12:12:12.063   content111
005         A   2021-04-13 12:13:23.063   content100
005  D-Direct   2021-04-13 12:15:23.063   content121
006         A   2021-04-13 12:51:12.063   content121
006  B-Direct   2021-04-13 12:52:13.063   content121
007  A-Direct   2021-04-13 12:51:12.063   content121
007         A   2021-04-13 12:52:13.063   content121
008  B-Direct   2021-04-13 12:55:12.063   content132
008         B   2021-04-13 12:56:11.063   content142
008  B-Direct   2021-04-13 12:57:22.063   content153
009  B-Direct   2021-04-13 12:55:12.063   content132
009  C-Direct   2021-04-13 12:56:11.063   content142
009  D-Direct   2021-04-13 12:57:22.063   content153

所以我需要一個表，其中每一行都包含一個不同的 ID。但是 ID 所依據的選擇標準有點復雜。

默認選擇應該是最近的條目，通過選擇TIMESTAMP。

但復雜性來自任何包含-Direct. 具體來說，如果一行有多個條目并且一個是（例如）A另一個是A-Direct，我們需要A. 只有當字母匹配時才會出現這種情況。正如在 for 的情況中所見ID = 006，我們想要B-Direct因為它的對應物是A。

所以我正在尋找的核心邏輯是

如果 ID 的行以相同的字串開頭，并且其中一個以結尾-Direct，則將其替換為-Direct已洗掉的。

最終輸出：

ID      Group
001         A
002  A-Direct
003        BA
004         C
005  D-Direct
006  B-Direct
007         A
008         B
009  D-Direct

為了更加清晰，以下是每個 ID 發生的情況的概述：

ID 001：A后面A-Direct，所以我們分A-Direct了A
ID 002：A-Direct是唯一的結果，簡單！
ID 003: BA, B,B-Direct是不同的，因此我們堅持使用最新的，BA。
ID 004：沒有直接的，所以我們只取最近的， C
ID 005：D-Direct是最新的，但因為沒有D，所以我們堅持D-Direct
ID 006：B-Direct是最近的，但是因為沒有B，所以堅持B-Direct
ID 007:A-Direct后面是A所以我們只取最近的一個，沒問題。
ID 008: Band B-Direct (x2) appear here, therefore we can use B.
ID 009: All options are Direct, so we go with the most recent, D-Direct

I can figure out how to get the most recent, but with the above criteria, I am unsure how to adjust

WITH data AS (
    select d.*,
        rank() over (
            partition by ID
            order by TIMESTAMP DESC
        ) as num
    FROM table d
)
select ID, TIMESTAMP
    from data
    where num = 1

uj5u.com熱心網友回復：

我可能會從以下內容開始。它不是超級漂亮，所以可能有更好的解決方案，但我認為它可以滿足您的需求。

WITH data AS (
    select d.*,
        rank() over (
            partition by ID
            order by TIMESTAMP DESC
        ) as num
    FROM table d
)
select ID, 
 CASE 
  WHEN EXISTS (SELECT * FROM table t WHERE t.id = d.id AND t.group || '-Direct' = d.group) 
   THEN replace(d.group, '-Direct') 
   ELSE d.group 
 END group
    from data d
    where num = 1

這會為每個 id 獲取最新的一個（使用您當前的代碼），但是子句中的case/exists陳述句select檢查是否存在沒有“-Direct”的匹配項，如果是，我們從字串中洗掉“-Direct”。

uj5u.com熱心網友回復：

使用：

SELECT ID
   ,CASE WHEN MIN(group) OVER(PARTITION BY ID, REPLACE(group, '-Direct'))
             = MAX(group) OVER(PARTITION BY ID, REPLACE(group, '-Direct'))
         THEN group
         ELSE REPLACE(group, '-Direct')
    END AS grp
FROM tab
QUALIFY RANK() OVER(PARTITION BY ID ORDER BY TIMESTAMP DESC) = 1;

Qualify 確保采用每個時間戳的最新值，并且 case 運算式處理“-Direct”覆寫。

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/355179.html

標籤：sql string snowflake-cloud-data-platform window-functions partition

上一篇：獲取最少N行，加上最后一行的所有對等點

下一篇：如何從JS字串中提取某些字符