Java 类名:com.alibaba.alink.operator.batch.statistics.RankingListBatchOp
Python 类名:RankingListBatchOp
排行榜是用来计算分组榜单的,例如数据是
选择marital(婚姻状况)作为分组列, age(年龄)作为主体列, balance(净资产)作为计算列,计算指标是sum,<br />
那么结果
marital | age | sum(balance) | rank |
---|---|---|---|
divorced | 54 | 318166.0 | 1 |
divorced | 56 | 283257.0 | 2 |
divorced | 59 | 281327.0 | 3 |
divorced | 58 | 263003.0 | 4 |
married | 37 | 1389347.0 | 1 |
married | 45 | 1372091.0 | 2 |
married | 39 | 1301587.0 | 3 |
married | 36 | 1274481.0 | 4 |
single | 32 | 1365481.0 | 1 |
single | 31 | 1348609.0 | 2 |
single | 30 | 1287184.0 | 3 |
single | 33 | 1044254.0 | 4 |
可以看出,离婚人群中50岁往上的资产比较多,结婚人群中35到45岁资产比较多,单身的30到35资产比较多。
名称 | 中文名称 | 描述 | 类型 | 是否必须? | 取值范围 | 默认值 |
---|---|---|---|---|---|---|
objectCol | 主体列 | 主体列 | String | ✓ | ||
addedCols | 附加列 | 附加列 | String[] | null | ||
addedStatTypes | 附加列统计类型 | 附加列统计类型 | String[] | null | ||
groupCol | 分组单列名 | 分组单列名,可选 | String | null | ||
groupValues | 计算分组 | 计算分组, 分组列选择时必选, 用逗号分隔 | String[] | null | ||
isDescending | 是否降序 | 是否降序 | Boolean | false | ||
statCol | 计算列 | 计算列 | String | 所选列类型为 [BIGDECIMAL, BIGINTEGER, BYTE, DOUBLE, FLOAT, INTEGER, LONG, SHORT] | null | |
statType | 统计类型 | 统计类型 | String | “count”, “countTotal”, “min”, “max”, “sum”, “mean”, “variance” | “count” | |
topN | 个数 | 个数 | Integer | 10 |
from pyalink.alink import * import pandas as pd useLocalEnv(1) df = pd.DataFrame([ ["1", "a", 1.3, 1.1], ["1", "b", -2.5, 0.9], ["2", "c", 100.2, -0.01], ["2", "d", -99.9, 100.9], ["1", "a", 1.4, 1.1], ["1", "b", -2.2, 0.9], ["2", "c", 100.9, -0.01], ["2", "d", -99.5, 100.9] ]) batchData = BatchOperator.fromDataframe(df, schemaStr='id string, col1 string, col2 double, col3 double') rankList = RankingListBatchOp()\ .setGroupCol("id")\ .setGroupValues(["1", "2"])\ .setObjectCol("col1")\ .setStatCol("col2")\ .setStatType("sum")\ .setTopN(20) batchData.link(rankList).print()
id | col1 | col2 | rank |
---|---|---|---|
1 | b | -4.7 | 1 |
1 | a | 2.7 | 2 |
2 | d | -199.4 | 1 |
2 | c | 201.1 | 2 |