最近开始接触pyspark,其中DataFrame的应用很重要也很简便。因此,这里记录一下自己的学习笔记。
详细的应用可以参看pyspark.sql module。这是官网文档,里面记录了详细的DataFrame使用说明。
目录
一、创建DF或者读入DF
二、查
2.1 行元素查询操作
2.2 列元素操作
2.3 排序
2.4 抽样
三、 增、改
四、合并 join / union
4.1 横向拼接union
4.2 Join根据条件
4.3 求并集、交集
4.4 分割:行转列
五、 频数统计与筛选
以sql输出的结果创建df,这种形式最常用。
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql.functions import *
df = spark.sql("select * from table_name")
也可以使用toDF()
from pyspark.sql import Row
row = Row("spe_id", "InOther")
x = ['x1','x2']
y = ['y1','y2']
new_df = sc.parallelize([row(x[i], y[i]) for i in range(2)]).toDF()
当然,也可以采用下面的方式创建DF,我们这里造了下面的数据集来说明df的一系列操作。
test = []
test.append((1, 'age', '30', 50, 40))
test.append((1, 'city', 'beijing', 50, 40))
test.append((1, 'gender', 'fale', 50, 40))
test.append((1, 'height', '172cm', 50, 40))
test.append((1, 'weight', '70kg', 50, 40))
test.append((2, 'age', '26', 100, 80))
test.append((2, 'city', 'beijing', 100, 80))
test.append((2, 'gender', 'fale', 100, 80))
test.append((2, 'height', '170cm', 100, 80))
test.append((2, 'weight', '65kg', 100, 80))
test.append((3, 'age', '35', 99, 99))
test.append((3, 'city', 'nanjing', 99, 99))
test.append((3, 'gender', 'female', 99, 99))
test.append((3, 'height', '161cm', 99, 99))
test.append((3, 'weight', '50kg', 99, 99))
df = spark.createDataFrame(test,['user_id', 'attr_name','attr_value', 'income', 'expenses'])
createDataFrame有一个参数,samplingRatio。这个参数的含义是:如果df的某列的类型不确定,则抽样百分之samplingRatio的数据来看是什么类型。因此,我们一般设定其为1。即,只要该列有1个数据不为空,该列的类型就不会为null。
打印数据
df.show()默认打印前20条数据,当然可以指定具体打印多少条数据。
如果有些属性值特别长,pyspark会截断数据导致打不全,这时候可以使用df.show(truncate=False)
>>> df.show()
+-------+---------+----------+------+--------+
|user_id|attr_name|attr_value|income|expenses|
+-------+---------+----------+------+--------+
| 1| age| 30| 50| 40|
| 1| city| beijing| 50| 40|
| 1| gender| fale| 50| 40|
| 1| height| 172cm| 50| 40|
| 1| weight| 70kg| 50| 40|
| 2| age| 26| 100| 80|
| 2| city| beijing| 100| 80|
| 2| gender| fale| 100| 80|
| 2| height| 170cm| 100| 80|
| 2| weight| 65kg| 100| 80|
| 3| age| 35| 99| 99|
| 3| city| nanjing| 99| 99|
| 3| gender| female| 99| 99|
| 3| height| 161cm| 99| 99|
| 3| weight| 50kg| 99| 99|
+-------+---------+----------+------+--------+
>>> df.show(3)
+-------+---------+----------+------+--------+
|user_id|attr_name|attr_value|income|expenses|
+-------+---------+----------+------+--------+
| 1| age| 30| 50| 40|
| 1| city| beijing| 50| 40|
| 1| gender| fale| 50| 40|
+-------+---------+----------+------+--------+
only showing top 3 rows
打印概要
>>> df.printSchema()
root
|-- user_id: long (nullable = true)
|-- attr_name: string (nullable = true)
|-- attr_value: string (nullable = true)
|-- income: long (nullable = true)
|-- expenses: long (nullable = true)
查询总行数
>>> df.count()
15
获取头几行到本地
>>> list = df.head(3)
>>> df.head(3)
[Row(user_id=1, attr_name=u'age', attr_value=u'30', income=50, expenses=40), Row(user_id=1, attr_name=u'city', attr_value=u'beijing', income=50, expenses=40), Row(user_id=1, attr_name=u'gender', attr_value=u'fale', income=50, expenses=40)]
>>> df.take(5)
[Row(user_id=1, attr_name=u'age', attr_value=u'30', income=50, expenses=40), Row(user_id=1, attr_name=u'city', attr_value=u'beijing', income=50, expenses=40), Row(user_id=1, attr_name=u'gender', attr_value=u'fale', income=50, expenses=40), Row(user_id=1, attr_name=u'height', attr_value=u'172cm', income=50, expenses=40), Row(user_id=1, attr_name=u'weight', attr_value=u'70kg', income=50, expenses=40)]
查询某列为null的行
>>> from pyspark.sql.functions import isnull
>>> df = df.filter(isnull("income"))
>>> df.show()
19/02/22 17:05:51 WARN DFSClient: Slow ReadProcessor read fields took 87487ms (threshold=30000ms); ack: seqno: 198 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 17565965 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.21.3.38:50010,DS-82aedc87-a850-40aa-9d04-dc62ab0988ef,DISK], DatanodeInfoWithStorage[172.21.80.165:50010,DS-305daec5-3c77-48cd-bee2-4f839aea8bb4,DISK], DatanodeInfoWithStorage[172.21.151.40:50010,DS-29ba84d5-ad7d-407f-9484-d85aa3f0a736,DISK]]
+-------+---------+----------+------+--------+
|user_id|attr_name|attr_value|income|expenses|
+-------+---------+----------+------+--------+
+-------+---------+----------+------+--------+
输出list类型,list中每个元素是Row类:
>>> df.collect()
[Row(user_id=1, attr_name=u'age', attr_value=u'30', income=50, expenses=40), Row(user_id=1, attr_name=u'city', attr_value=u'beijing', income=50, expenses=40), Row(user_id=1, attr_name=u'gender', attr_value=u'fale', income=50, expenses=40), Row(user_id=1, attr_name=u'height', attr_value=u'172cm', income=50, expenses=40), Row(user_id=1, attr_name=u'weight', attr_value=u'70kg', income=50, expenses=40), Row(user_id=2, attr_name=u'age', attr_value=u'26', income=100, expenses=80), Row(user_id=2, attr_name=u'city', attr_value=u'beijing', income=100, expenses=80), Row(user_id=2, attr_name=u'gender', attr_value=u'fale', income=100, expenses=80), Row(user_id=2, attr_name=u'height', attr_value=u'170cm', income=100, expenses=80), Row(user_id=2, attr_name=u'weight', attr_value=u'65kg', income=100, expenses=80), Row(user_id=3, attr_name=u'age', attr_value=u'35', income=99, expenses=99), Row(user_id=3, attr_name=u'city', attr_value=u'nanjing', income=99, expenses=99), Row(user_id=3, attr_name=u'gender', attr_value=u'female', income=99, expenses=99), Row(user_id=3, attr_name=u'height', attr_value=u'161cm', income=99, expenses=99), Row(user_id=3, attr_name=u'weight', attr_value=u'50kg', income=99, expenses=99)]
注:此方法将所有数据全部导入到本地,返回一个Array对象。当然,我们可以取出Array中的值,是一个Row,我们也可以取出Row中的值。
>>> list = df.collect()
>>> 19/02/22 16:54:04 WARN DFSClient: Slow ReadProcessor read fields took 43005ms (threshold=30000ms); ack: seqno: 179 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 18446744073455908425 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.21.3.38:50010,DS-82aedc87-a850-40aa-9d04-dc62ab0988ef,DISK], DatanodeInfoWithStorage[172.21.80.165:50010,DS-305daec5-3c77-48cd-bee2-4f839aea8bb4,DISK], DatanodeInfoWithStorage[172.21.151.40:50010,DS-29ba84d5-ad7d-407f-9484-d85aa3f0a736,DISK]]
>>> list[0]
Row(user_id=1, attr_name=u'age', attr_value=u'30', income=50, expenses=40)
>>> list[0][1]
u'age'
查询概况
>>> df.describe().show()
19/02/22 16:58:23 WARN DFSClient: Slow ReadProcessor read fields took 78649ms (threshold=30000ms); ack: seqno: 188 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 187817284 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.21.3.38:50010,DS-82aedc87-a850-40aa-9d04-dc62ab0988ef,DISK], DatanodeInfoWithStorage[172.21.80.165:50010,DS-305daec5-3c77-48cd-bee2-4f839aea8bb4,DISK], DatanodeInfoWithStorage[172.21.151.40:50010,DS-29ba84d5-ad7d-407f-9484-d85aa3f0a736,DISK]]
+-------+------------------+---------+------------------+-----------------+------------------+
|summary| user_id|attr_name| attr_value| income| expenses|
+-------+------------------+---------+------------------+-----------------+------------------+
| count| 15| 15| 15| 15| 15|
| mean| 2.0| null|30.333333333333332| 83.0| 73.0|
| stddev|0.8451542547285166| null| 4.509249752822894|24.15722311383137|25.453037988757707|
| min| 1| age| 161cm| 50| 40|
| max| 3| weight| nanjing| 100| 99|
+-------+------------------+---------+------------------+-----------------+------------------+
去重set操作
>>> df.select('user_id').distinct().show()
+-------+
|user_id|
+-------+
| 1|
| 3|
| 2|
+-------+
选择一列或多列:select
df["age"]
df.age
df.select(“name”)
df.select(df[‘name’], df[‘age’]+1)
df.select(df.a, df.b, df.c) # 选择a、b、c三列
df.select(df["a"], df["b"], df["c"]) # 选择a、b、c三列
用where按条件选择
>>> df.where("income = 50" ).show()
+-------+---------+----------+------+--------+
|user_id|attr_name|attr_value|income|expenses|
+-------+---------+----------+------+--------+
| 1| age| 30| 50| 40|
| 1| city| beijing| 50| 40|
| 1| gender| fale| 50| 40|
| 1| height| 172cm| 50| 40|
| 1| weight| 70kg| 50| 40|
+-------+---------+----------+------+--------+
orderBy:按指定字段排序,默认为升序
>>> df.orderBy(df.income.desc()).show()
19/02/22 18:02:31 WARN DFSClient: Slow ReadProcessor read fields took 87360ms (threshold=30000ms); ack: seqno: 325 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 14139744 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.21.3.38:50010,DS-82aedc87-a850-40aa-9d04-dc62ab0988ef,DISK], DatanodeInfoWithStorage[172.21.80.165:50010,DS-305daec5-3c77-48cd-bee2-4f839aea8bb4,DISK], DatanodeInfoWithStorage[172.21.151.40:50010,DS-29ba84d5-ad7d-407f-9484-d85aa3f0a736,DISK]]
+-------+---------+----------+------+--------+
|user_id|attr_name|attr_value|income|expenses|
+-------+---------+----------+------+--------+
| 2| gender| fale| 100| 80|
| 2| weight| 65kg| 100| 80|
| 2| height| 170cm| 100| 80|
| 2| age| 26| 100| 80|
| 2| city| beijing| 100| 80|
| 3| gender| female| 99| 99|
| 3| age| 35| 99| 99|
| 3| height| 161cm| 99| 99|
| 3| weight| 50kg| 99| 99|
| 3| city| nanjing| 99| 99|
| 1| age| 30| 50| 40|
| 1| height| 172cm| 50| 40|
| 1| city| beijing| 50| 40|
| 1| weight| 70kg| 50| 40|
| 1| gender| fale| 50| 40|
+-------+---------+----------+------+--------+
sample是抽样函数,其中withReplacement = True or False代表是否有放回。42是seed。
t1 = train.sample(False, 0.2, 42)
新增数据列
withColumn是通过添加或替换与现有列有相同的名字的列,返回一个新的DataFrame。
但是,我们这么写会报错
>>> df.withColumn('label', 0)
Traceback (most recent call last):
File "
File "/software/servers/10k/mart_vdp/spark/python/pyspark/sql/dataframe.py", line 1848, in withColumn
assert isinstance(col, Column), "col should be Column"
AssertionError: col should be Column
报错:AssertionError: col should be Column,即一定要指定某现有列。有两种方式可以实现:
一种方式通过functions
>>> from pyspark.sql.functions import *
>>> df.withColumn('label', lit(0))
DataFrame[user_id: bigint, attr_name: string, attr_value: string, income: bigint, expenses: bigint, label: int]
另一种方式是通过另一个已有变量:
>>> df.withColumn('income1', df.income+10).show(5)
19/02/22 18:25:03 WARN DFSClient: Slow ReadProcessor read fields took 34439ms (threshold=30000ms); ack: seqno: 382 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 26903061 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.21.3.38:50010,DS-82aedc87-a850-40aa-9d04-dc62ab0988ef,DISK], DatanodeInfoWithStorage[172.21.80.165:50010,DS-305daec5-3c77-48cd-bee2-4f839aea8bb4,DISK], DatanodeInfoWithStorage[172.21.151.40:50010,DS-29ba84d5-ad7d-407f-9484-d85aa3f0a736,DISK]]
+-------+---------+----------+------+--------+-------+
|user_id|attr_name|attr_value|income|expenses|income1|
+-------+---------+----------+------+--------+-------+
| 1| age| 30| 50| 40| 60|
| 1| city| beijing| 50| 40| 60|
| 1| gender| fale| 50| 40| 60|
| 1| height| 172cm| 50| 40| 60|
| 1| weight| 70kg| 50| 40| 60|
+-------+---------+----------+------+--------+-------+
only showing top 5 rows
修改列名
>>> df.withColumnRenamed( "income" , "income2" ).show(3)
+-------+---------+----------+-------+--------+
|user_id|attr_name|attr_value|income2|expenses|
+-------+---------+----------+-------+--------+
| 1| age| 30| 50| 40|
| 1| city| beijing| 50| 40|
| 1| gender| fale| 50| 40|
+-------+---------+----------+-------+--------+
only showing top 3 rows
利用union可以按行拼接
>>> df.union(df).show()
+-------+---------+----------+------+--------+
|user_id|attr_name|attr_value|income|expenses|
+-------+---------+----------+------+--------+
| 1| age| 30| 50| 40|
| 1| city| beijing| 50| 40|
| 1| gender| fale| 50| 40|
| 1| height| 172cm| 50| 40|
| 1| weight| 70kg| 50| 40|
| 2| age| 26| 100| 80|
| 2| city| beijing| 100| 80|
| 2| gender| fale| 100| 80|
| 2| height| 170cm| 100| 80|
| 2| weight| 65kg| 100| 80|
| 3| age| 35| 99| 99|
| 3| city| nanjing| 99| 99|
| 3| gender| female| 99| 99|
| 3| height| 161cm| 99| 99|
| 3| weight| 50kg| 99| 99|
| 1| age| 30| 50| 40|
| 1| city| beijing| 50| 40|
| 1| gender| fale| 50| 40|
| 1| height| 172cm| 50| 40|
| 1| weight| 70kg| 50| 40|
+-------+---------+----------+------+--------+
only showing top 20 rows
单字段Join
合并2个表的join方法:
df_join = df_left.join(df_right, df_left.key == df_right.key, "inner")
来看一个例子,先构造两个dataframe:
sentenceDataFrame = spark.createDataFrame((
(1, "asf"),
(2, "2143"),
(3, "rfds")
)).toDF("label", "sentence")
sentenceDataFrame.show()
sentenceDataFrame1 = spark.createDataFrame((
(1, "asf"),
(2, "2143"),
(4, "f8934y")
)).toDF("label", "sentence")
# 差集
newDF = sentenceDataFrame1.select("sentence").subtract(sentenceDataFrame.select("sentence"))
newDF.show()
+--------+
|sentence|
+--------+
| f8934y|
+--------+
# 交集
newDF = sentenceDataFrame1.select("sentence").intersect(sentenceDataFrame.select("sentence"))
newDF.show()
+--------+
|sentence|
+--------+
| asf|
| 2143|
+--------+
# 并集
newDF = sentenceDataFrame1.select("sentence").union(sentenceDataFrame.select("sentence"))
newDF.show()
+--------+
|sentence|
+--------+
| asf|
| 2143|
| f8934y|
| asf|
| 2143|
| rfds|
+--------+
有时候需要根据某个字段内容进行分割,然后生成多行,这时可以使用explode方法。下
面代码中,根据c3字段中的空格将字段内容进行分割,分割的内容存储在新的字段c3_中,如下所示
注:spark的“惰性”性质导致上面的结果,即df.count()是一个Transformations操作,只有执行Action时,
在stat模块中。参考文献【2】有详细介绍。
分组统计 group by
train.groupby('Age').agg({'Purchase': 'mean'}).show()
Output:
+-----+-----------------+
| Age| avg(Purchase)|
+-----+-----------------+
|51-55|9534.808030960236|
|46-50|9208.625697468327|
| 0-17|8933.464640444974|
|36-45|9331.350694917874|
|26-35|9252.690632869888|
| 55+|9336.280459449405|
|18-25|9169.663606261289|
+-----+-----------------+
应用多个函数
df.groupBy(“A”).agg(functions.avg(“B”), functions.min(“B”), functions.max(“B”)).show()
apply 函数
udf 函数应用
等等。
当然,pyspark的df的功能特别强大。我这里就不再一一举例了,详见参考文献【2】
参考文献:
【1】PySpark︱DataFrame操作指南:增/删/改/查/合并/统计与数据处理
【2】pyspark.sql module