前言
流式計算可能在日常不多見,主要統計一個階段內的PV、UV,在風控場景很常見,比如統計某個用戶一天內同地區下單總量來判斷該用戶是否為例外用戶,還有一些大資料處理場景,如將某一段時間生成的日志按需要加工后倒入到存盤DB中做查詢報表,為什么要學習Flink,因為最近碰到一些實時計算性能問題,其次也不太理解實時計算底層實作原理,這里拿當下很流行的開源工具Flink作為待學習物件,一步一步深入Flink底層探索實時計算奧秘,
第一個程式
導maven依賴,主要依賴項如下:
<properties>
<blink.version>1.5.1</blink.version>
<scala.binary.version>2.11</scala.binary.version>
<blink-streaming.version>1.5.1</blink-streaming.version>
<log4j.version>1.2.17</log4j.version>
<slf4j-log4j.version>1.7.9</slf4j-log4j.version>
</properties>
<dependencyManagement>
<dependencies>
<dependency>
<groupId>com.alibaba.blink</groupId>
<artifactId>flink-core</artifactId>
<version>${blink.version}</version>
</dependency>
<dependency>
<groupId>com.alibaba.blink</groupId>
<artifactId>flink-clients_2.11</artifactId>
<version>${blink.version}</version>
</dependency>
<!-- blink stream java -->
<dependency>
<groupId>com.alibaba.blink</groupId>
<artifactId>flink-streaming-java_${scala.binary.version}</artifactId>
<version>${blink-streaming.version}</version>
</dependency>
<dependency>
<groupId>com.alibaba.blink</groupId>
<artifactId>flink-test-utils-junit</artifactId>
<version>${blink.version}</version>
<scope>test</scope>
</dependency>
<!-- logging framework -->
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>${log4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>${slf4j-log4j.version}</version>
</dependency>
</dependencies>
</dependencyManagement>
這里引入比較干凈,只包含flink相關核心包+日志包,接下來開始使用flink API完成第一個Hello World程式,這里我用的是flink官方WordCount Demo,代碼如下:
package com.alibaba.security.blink;
import com.alibaba.security.blink.util.WordCountData;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.util.Collector;
public class WordCount {
private final ParameterTool params;
private final ExecutionEnvironment env;
public WordCount(String[] args) {
this.params = ParameterTool.fromArgs(args);
this.env = ExecutionEnvironment.createLocalEnvironment();
env.getConfig().setGlobalJobParameters(params);
}
public static void main(String[] args) throws Exception {
WordCount wordCount = new WordCount(args);
DataSet<String> dataSet = wordCount.getDataSetFromCommandLine();
wordCount.executeFrom(dataSet);
}
private DataSet<String> getDataSetFromCommandLine() {
DataSet<String> text;
if (params.has("input")) {
text = env.readTextFile(params.get("input"));
} else {
System.out.println("Executing WordCount example with default input data set.");
System.out.println("Use --input to specify file input.");
text = WordCountData.getDefaultTextLineDataSet(env);
}
return text;
}
private void executeFrom(DataSet<String> text) throws Exception {
DataSet<Tuple2<String, Integer>> counts =
text.flatMap(new Tokenizer())
.groupBy(0)
.sum(1);
if (this.params.has("output")) {
counts.writeAsCsv(this.params.get("output"), "\n", " ");
env.execute("WordCount Example");
} else {
System.out.println("Printing result to stdout. Use --output to specify output path.");
counts.print();
}
}
public static final class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
// normalize and split the line
String[] tokens = value.toLowerCase().split("\\W+");
// emit the pairs
for (String token : tokens) {
if (token.length() > 0) {
out.collect(new Tuple2<String, Integer>(token, 1));
}
}
}
}
}
WordCountData.java代碼如下:
package com.alibaba.security.blink.util;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
public class WordCountData {
public static final String[] WORDS = new String[] {
"To be, or not to be,--that is the question:--",
"Whether 'tis nobler in the mind to suffer",
"The slings and arrows of outrageous fortune",
"Or to take arms against a sea of troubles,",
"And by opposing end them?--To die,--to sleep,--",
"No more; and by a sleep to say we end",
"The heartache, and the thousand natural shocks",
"That flesh is heir to,--'tis a consummation",
"Devoutly to be wish'd. To die,--to sleep;--",
"To sleep! perchance to dream:--ay, there's the rub;",
"For in that sleep of death what dreams may come,",
"When we have shuffled off this mortal coil,",
"Must give us pause: there's the respect",
"That makes calamity of so long life;",
"For who would bear the whips and scorns of time,",
"The oppressor's wrong, the proud man's contumely,",
"The pangs of despis'd love, the law's delay,",
"The insolence of office, and the spurns",
"That patient merit of the unworthy takes,",
"When he himself might his quietus make",
"With a bare bodkin? who would these fardels bear,",
"To grunt and sweat under a weary life,",
"But that the dread of something after death,--",
"The undiscover'd country, from whose bourn",
"No traveller returns,--puzzles the will,",
"And makes us rather bear those ills we have",
"Than fly to others that we know not of?",
"Thus conscience does make cowards of us all;",
"And thus the native hue of resolution",
"Is sicklied o'er with the pale cast of thought;",
"And enterprises of great pith and moment,",
"With this regard, their currents turn awry,",
"And lose the name of action.--Soft you now!",
"The fair Ophelia!--Nymph, in thy orisons",
"Be all my sins remember'd."
};
public static DataSet<String> getDefaultTextLineDataSet(ExecutionEnvironment env) {
return env.fromElements(WORDS);
}
}
運行時如果加了命令列引數--input則從自定義輸入檔案中讀取內容,否則從WordCountData中讀取,
Flink代碼啟動通過org.apache.flink.api.java.ExecutionEnvironment#createLocalEnvironment()來完成,表示flink本地啟動,flink只能處理DataSet,因此任何資料想要在flink里處理,都要被轉換成DataSet,這里將文本轉化為DataSet通過調org.apache.flink.api.java.ExecutionEnvironment#readTextFile方法,下面executeFrom方法就是flink核心處理流程了,先將一行行文本打散轉化為Tuple2物件,Tuple2就是一個KV,然后對打散后的Tuple2集合進行groupBy,相同單詞將被groupBy一起,最后將所有相同單詞相加(sum),最終得到每個單詞出現次數,
API呼叫
flatMap
flatMap在java里用的也不多,主要用的還是map,這里我用jdk1.8 Stream API寫了一個flatMap demo
import org.apache.commons.lang3.tuple.Pair;
import java.util.Arrays;
import java.util.List;
import java.util.function.Function;
import java.util.stream.Collectors;
import java.util.stream.Stream;
class Scratch {
public static void main(String[] args) {
String str = "abc,debf";
String[] strArray = str.split(",");
// 使用flatMap回傳多個元素(Stream)
List<Object> result = Arrays.stream(strArray).flatMap(new Function<String, Stream<?>>() {
@Override
public Stream<?> apply(String s) {
return Stream.of(s.split("b"));
}
}).collect(Collectors.toList());
System.out.println(result);
// 使用map方式只能回傳一個元素
List<Pair<String, Integer>> tupleResultList = Arrays.stream(strArray).map(new Function<String, Pair<String, Integer>>() {
@Override
public Pair<String, Integer> apply(String s) {
return Pair.of(s, s.length());
}
}).collect(Collectors.toList());
System.out.println(tupleResultList);
}
}
先將字串str拆分成一個陣列,然后遍歷陣列對資料中每個字串再進行切割,將切割后生成的字串陣列重新構建為一個Stream物件并回傳,也就是說flatMap做了一個一變多的事,一個流變成多個流了:

將Stream1中每個元素都遍歷一遍,然后將遍歷的每個元素又轉化成一個Stream物件,最終生成的就是一個Stream集合,
如果用map只能回傳一個元素

flink中flatMap和map也是一樣的道理,上面flink例子里用的是flatMap,將每行記錄轉化后的單詞都保存到Collector里,后面該Collector可以作為輸入執行groupBy操作,而如果是換成map該怎么寫呢?代碼如下
// 這里map每次方法呼叫只會回傳一個Tuple2物件
AggregateOperator<Tuple2<String, Integer>> firstSumResult = text.map(new MapFunction<String, Tuple2<String, Integer>>() {
@Override
public Tuple2<String, Integer> map(String value) throws Exception {
String[] tokens = value.toLowerCase().split("\\W+");
if (tokens.length > 0) {
return new Tuple2<String, Integer>(tokens[0], 1);
}
return null;
}
}).groupBy(0).sum(1);
List<Tuple2<String, Integer>> result = firstSumResult.collect();
result.forEach(e -> LOGGER.info("word={},count={}", e.f0, e.f1));
可以看出MapFunction#map是有回傳值的,且回傳值為單元素,后面groupBy都是針對map后生成的集合來操作,
因此如何選擇map與flatMap我個人認為:如果只是便利DataSet將一個物件轉化成另一個物件可以使用map函式,如果是一個物件轉化成多個物件,可以使用flatMap,
groupBy
groupBy也是DataSet提供的標準API之一,該方法有3個多載的方法,如下

groupBy(int... fields)
該方法只能對Tuple型別DataSet起作用,Tuple有哪些類呢?

使用該種groupBy方法舉個例子:
/**
* @author shaoxian.ssx
* @date 2021/11/7
*/
public class SimpleGroupBy {
private final static Logger LOGGER = LoggerFactory.getLogger(SimpleGroupBy.class);
public static void main(String[] args) throws Exception {
LocalEnvironment env = ExecutionEnvironment.createLocalEnvironment();
List<Tuple2<String, Integer>> list = new ArrayList<>(16);
Random random = new Random();
for (int i = 0; i < 16; i++) {
int num = random.nextInt(20);
list.add(new Tuple2<String, Integer>(String.valueOf(num), num));
}
LOGGER.info("listResult={}", list);
List<Tuple2<String, Integer>> flinkRes = env.fromCollection(list).groupBy(0).sum(1).collect();
LOGGER.info("flinkResult={}", flinkRes);
}
}
輸出如下:
16:41:33,708 [ main] INFO com.alibaba.security.blink.SimpleGroupBy - listResult=[(1,1), (9,9), (14,14), (18,18), (1,1), (7,7), (16,16), (1,1), (4,4), (16,16), (15,15), (17,17), (16,16), (0,0), (12,12), (15,15)]
16:41:37,573 [ main] INFO com.alibaba.security.blink.SimpleGroupBy - flinkResult=[(12,12), (18,18), (9,9), (14,14), (15,30), (7,7), (16,48), (17,17), (4,4), (0,0), (1,3)]
通過上面groupBy例子可以看出,groupBy(int... fields) 方法僅針對DataSet型別為Tuple系列的資料源才有效,fields順序為Tuple中屬性位置,如Tuple第0號屬性,則引數為0,以此類推,
groupBy(String... fields)
該方法可以針對那些DataSet為POJO型別資料源,方法引數為POJO屬性且該屬性必須有公共的setter、getter方法,并且該POJO必須有一個默認無引數構造方法,舉個例子,獲取某個用戶所有下單IP個數,代碼如下:
/**
* @author shaoxian.ssx
* @date 2021/11/7
*/
public class SimpleGroupBy {
private final static Logger LOGGER = LoggerFactory.getLogger(SimpleGroupBy.class);
public static void main(String[] args) throws Exception {
LocalEnvironment env = ExecutionEnvironment.createLocalEnvironment();
Order order1 = new Order(1001L, "張三", "192.168.1.10");
Order order2 = new Order(1002L, "李四", "192.168.1.212");
Order order3 = new Order(1001L, "張三", "192.168.1.50");
Order order4 = new Order(1003L, "王五", "192.168.1.71");
DataSource<Order> dataSource = env.fromElements(order1, order2, order3, order4);
List<Order> result = dataSource.groupBy("byrId").reduce(new ReduceFunction<Order>() {
@Override
public Order reduce(Order value1, Order value2) throws Exception {
if (!value1.getIp().equals(value2.getIp())) {
value1.setIpCount(value1.getIpCount() + value2.getIpCount());
return value1;
}
return value1;
}
}).collect();
result.forEach(e -> LOGGER.info("order={}", e));
}
/**
* 必須為public型別,否則flink校驗型別會報錯
*/
public static class Order {
private Long byrId;
private String name;
private String ip;
private int ipCount = 1;
// 必須提供無引數構造方法
public Order() {
}
public Order(Long byrId, String name, String ip) {
this.byrId = byrId;
this.name = name;
this.ip = ip;
}
// 省略setter、getter方法...
@Override
public String toString() {
return new StringJoiner(", ", Order.class.getSimpleName() + "[", "]")
.add("byrId=" + byrId)
.add("name='" + name + "'")
.add("ip='" + ip + "'")
.add("ipCount=" + ipCount)
.toString();
}
}
}
輸出結果如下:
17:08:02,922 [ main] INFO com.alibaba.security.blink.SimpleGroupBy - order=Order[byrId=1001, name='張三', ip='192.168.1.10', ipCount=2]
17:08:02,922 [ main] INFO com.alibaba.security.blink.SimpleGroupBy - order=Order[byrId=1003, name='王五', ip='192.168.1.71', ipCount=1]
17:08:02,922 [ main] INFO com.alibaba.security.blink.SimpleGroupBy - order=Order[byrId=1002, name='李四', ip='192.168.1.212', ipCount=1]
groupBy(KeySelector<T, K> keyExtractor)
該方法個人感覺跟上看屬性groupBy差不多,只是寫起來更好看點,也是針對POJO型別資料源,其實準確說是有public型別setter、getter方法屬性,例如Tuple2中f0、f1也可以用,還是用上面例子改用KeySelector如下:
// 告訴flink使用Order物件byrId值進行groupBy
List<Order> result = dataSource.groupBy(new KeySelector<Order, Long>() {
@Override
public Long getKey(Order value) throws Exception {
return value.getByrId();
}
}).reduce(new ReduceFunction<Order>() {
@Override
public Order reduce(Order value1, Order value2) throws Exception {
if (!value1.getIp().equals(value2.getIp())) {
value1.setIpCount(value1.getIpCount() + value2.getIpCount());
return value1;
}
return value1;
}
}).collect();
result.forEach(e -> LOGGER.info("order={}", e));
UnsortedGrouping
UnsortedGrouping為groupBy回傳物件,為什么要說groupBy呢?因為在流式計算中groupBy是最常見的場景,如groupBy商品ID來判斷哪個商品買的最多;groupBy地址來判斷哪個地方地址聚集度等等,一般sql寫完了group by后通常都要進行count,那flink在flink中怎么做呢?flink最終聚合計算調的方法都在這個UnsortedGrouping類中,count在這里為reduce操作,reduce計算邏輯封裝在ReduceFunction中,如上面統計所有訂單相同買家IP個數,在reduce中針對不同IP做了+1操作,在reduce執行完后,拿到的那個Order物件里ipCount就是最終累加后的總IP個數,當然這個UnsortedGrouping里還有很多有用方法,如maxBy、minBy、sum,這里寫個demo演示一下:
/**
* @author shaoxian.ssx
* @date 2021/11/7
*/
public class SimpleGroupBy {
private final static Logger LOGGER = LoggerFactory.getLogger(SimpleGroupBy.class);
public static void main(String[] args) throws Exception {
LocalEnvironment env = ExecutionEnvironment.createLocalEnvironment();
Order order1 = new Order(1001L, "張三", "192.168.1.10", 30d);
Order order2 = new Order(1002L, "李四", "192.168.1.212", 27d);
Order order3 = new Order(1001L, "張三", "192.168.1.50", 100d);
Order order4 = new Order(1003L, "王五", "192.168.1.71", 30d);
DataSource<Order> dataSource = env.fromElements(order1, order2, order3, order4);
// 先使用map將Order轉化為Tuple型別,然后再按照買家ID進行groupBy,最后篩選出每組中金額最大的一筆訂單并輸出
List<Tuple2<Double, Order>> result = dataSource.map(new MapFunction<Order, Tuple2<Double, Order>>() {
@Override
public Tuple2<Double, Order> map(Order value) throws Exception {
return new Tuple2<>(value.getTotal(), value);
}
}).groupBy("f1.byrId").maxBy(0).collect();
result.forEach(e -> LOGGER.info("order={}", e));
}
/**
* 必須為public型別,否則flink校驗型別會報錯
*/
public static class Order {
private Long byrId;
private String name;
private String ip;
// 訂單總金額
private double total;
private int ipCount = 1;
// 必須提供無引數構造方法
public Order() {
}
public Order(Long byrId, String name, String ip, double total) {
this.byrId = byrId;
this.name = name;
this.ip = ip;
this.total = total;
}
// 省略setter、getter方法...
public void setTotal(double total) {
this.total = total;
}
@Override
public String toString() {
return new StringJoiner(", ", Order.class.getSimpleName() + "[", "]")
.add("byrId=" + byrId)
.add("name='" + name + "'")
.add("ip='" + ip + "'")
.add("ipCount=" + ipCount)
.add("double=" + total)
.toString();
}
}
}
總結
今天學習的這些Demo及API也已入門flink,后續需要持續投入并帶來更多API呼叫探索及flink底層原理決議,
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/352153.html
標籤:其他
