根據條件識別hive中的第二個最大值-有解無憂

我有一個表，其中的行看起來像這樣，其中有一列按票證 ID 在時間戳 desc 上對所有行磁區進行排序。

所有行只能有一個等于1 的標志。

ticketID  |  flag 1  | flag 2 | flag 3 | flag 4 | Timestamp  |  Rank    |  stringvalue  |  
----------------------------------------------------------------------------------------|
   1      |    0     |    0   |    1   |    0   |  xxxxxx    |    2     |   aaaaaa      |
   1      |    0     |    0   |    0   |    1   |  xxxxxx    |    1     |   bbbbbb      |
   1      |    0     |    1   |    0   |    0   |  xxxxxx    |    3     |   aaaaaa      |
   2      |    1     |    0   |    0   |    0   |  xxxxxx    |    2     |   bbbbbb      |
   2      |    0     |    0   |    0   |    1   |  xxxxxx    |    1     |   xxxxxx      |
   3      |    0     |    0   |    1   |    0   |  xxxxxx    |    4     |   aaaaaa      |
   3      |    0     |    1   |    0   |    0   |  xxxxxx    |    3     |   bbbbbb      |
   3      |    1     |    0   |    0   |    0   |  xxxxxx    |    1     |   ssssss      |
   3      |    0     |    0   |    0   |    1   |  xxxxxx    |    2     |   nnnnnn      |
   4      |    0     |    1   |    0   |    0   |  xxxxxx    |    2     |   gggggg      |
   4      |    0     |    0   |    0   |    1   |  xxxxxx    |    1     |   iiiiii      |

對于每個ticketID，我需要根據排名獲取第一行，但特定標志除外：

當票的排名 1 是帶有標志 4 = 1 的行時，我需要將第二個排名位置作為第一個。如果票的第二個等級是標志 3 = 1，那么我需要將第一個等級（標志 = 4）的字串值與第二個等級（標志 = 3）連接起來。

如果第二個等級是 flag = 1 或 flag = 2，那么只需忘記第一個等級并將第二個作為第一個回傳。

我希望我的問題很清楚。

謝謝

編輯

樣本輸出：

----------------------------------------------------------------------------------------
ticketID  |  flag 1  | flag 2 | flag 3 | Timestamp  |  Rank    |  stringvalue          |  
---------------------------------------------------------------------------------------|
   1      |    0     |    0   |    1   |  xxxxxx    |    1     |   aaaaaa / bbbbbbb    |
   2      |    1     |    0   |    0   |  xxxxxx    |    1     |        bbbbbb         |
   3      |    1     |    0   |    0   |  xxxxxx    |    1     |        ssssss         |
   4      |    0     |    1   |    0   |  xxxxxx    |    1     |        gggggg         |
----------------------------------------------------------------------------------------

uj5u.com熱心網友回復：

我將使用一些帶有 struct group by 的子查詢。這將允許我們在不使用視窗的情況下詢問有關多行的問題。由于我們不必維護視窗狀態，因此可能會執行得更快。

create table theRanks (ticketID int, flag_1 int, flag_2 int, flag_3 int, flag_4 int, Timestamp string, Rank int, stringvalue string)
-- create some dummy data
insert into theRanks values ( 1 , 0, 0, 1, 0, 'xxxxxx', 2, 'aaaaaa')
insert into theRanks values ( 1 , 0, 0, 0, 1, 'xxxxxx', 1, 'bbbbbb')
insert into theRanks values ( 1 , 0, 1, 0, 0, 'xxxxxx', 3, 'aaaaaa')

with stuct_table as -- sub-query syntax
( 
  select 
   ticketID, 
   struct( -- struct will allow us to group rows together.
    Rank as rawRank, -- this has to be first in strut as we use it for sorting
    flag_1 , 
    flag_2, 
    flag_3, 
    flag_4 , 
    Timestamp , 
    stringvalue 
   ) as myRow 
 from 
  theRanks 
 where 
  rank in (1,2) -- only look at first two ranks
), 
constants as -- subquery
( 
 select 0 as rank1, 1 as rank2 -- strictly not needed just to help make it more readable 
), 
grouped_rows as --subquery
(
 select 
  ticketID, 
  array_sort(collect_list(myRow)) as row_list  -- will sort on rank all structs into a list
 from stuct_table 
 group by ticketID
) , 
raw_rows as (select --sub-query styntax
 ticketId, 
 case 
  when 
   row_list[constants.rank2].flag_1   row_list[constants.rank2].flag_2 > 0 or (row_list[constants.rank1].flag_4 = 1 and row_list[constants.rank2].flag_3  = 0 )
 then
   row_list[constants.rank2]
 when 
   row_list[constants.rank1].flag_4 = 1 and row_list[constants.rank2].flag_3  = 1 -- condition to concat string
 then
   struct( -- this struct must match the original one we created
    row_list[constants.rank2].rawRank as rawRank, 
    row_list[constants.rank2].flag_1 as flag_1,
    row_list[constants.rank2].flag_2 as flag_2,
    row_list[constants.rank2].flag_3 as flag_3,
    row_list[constants.rank2].flag_4 as flag_4,
    row_list[constants.rank2].Timestamp as Timestamp,
    concat(
      row_list[constants.rank1].stringvalue, 
      ' / ', 
      row_list[constants.rank2].stringvalue) as stringvalue
    )
 else
   row_list[constants.rank1]
 end as rankedRow,
 1 as Rank
from grouped_rows
cross join constants) -- not strictly needed, just replace all constants.rank1 with 0 and constants.rank2 with 1.  I just use it to make it more clear what I'm doing.  Could be replaced in production.
select rankedRow.* , 1 as Rank from raw_rows; -- makes struct columns into table columns

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/481784.html

標籤：sql Hadoop 蜂巢蜂巢式

上一篇：在IntelliJIdea中更改Flutter小部件助手的包裝順序

下一篇：使用Scala計算HDFS目錄中的檔案