PySpark-DataFrame各种常用操作举例--转载

最近开始接触pyspark,其中DataFrame的应用很重要也很简便。因此,这里记录一下自己的学习笔记。

详细的应用可以参看pyspark.sql module。这是官网文档,里面记录了详细的DataFrame使用说明。

目录

一、创建DF或者读入DF

二、查

2.1 行元素查询操作

2.2 列元素操作 

2.3 排序

2.4 抽样

三、 增、改

四、合并 join / union 

4.1 横向拼接union

4.2 Join根据条件

4.3 求并集、交集

4.4 分割:行转列

五、 频数统计与筛选


一、创建DF或者读入DF

以sql输出的结果创建df,这种形式最常用。

 
  1. from pyspark.sql import SparkSession

  2. from pyspark.sql import Row

  3. from pyspark.sql.types import *

  4. from pyspark.sql.functions import *

  5.  
  6. df = spark.sql("select * from table_name")

也可以使用toDF() 

 
  1. from pyspark.sql import Row

  2. row = Row("spe_id", "InOther")

  3. x = ['x1','x2']

  4. y = ['y1','y2']

  5. new_df = sc.parallelize([row(x[i], y[i]) for i in range(2)]).toDF()

当然,也可以采用下面的方式创建DF,我们这里造了下面的数据集来说明df的一系列操作。

 
  1. test = []

  2. test.append((1, 'age', '30', 50, 40))

  3. test.append((1, 'city', 'beijing', 50, 40))

  4. test.append((1, 'gender', 'fale', 50, 40))

  5. test.append((1, 'height', '172cm', 50, 40))

  6. test.append((1, 'weight', '70kg', 50, 40))

  7. test.append((2, 'age', '26', 100, 80))

  8. test.append((2, 'city', 'beijing', 100, 80))

  9. test.append((2, 'gender', 'fale', 100, 80))

  10. test.append((2, 'height', '170cm', 100, 80))

  11. test.append((2, 'weight', '65kg', 100, 80))

  12. test.append((3, 'age', '35', 99, 99))

  13. test.append((3, 'city', 'nanjing', 99, 99))

  14. test.append((3, 'gender', 'female', 99, 99))

  15. test.append((3, 'height', '161cm', 99, 99))

  16. test.append((3, 'weight', '50kg', 99, 99))

  17. df = spark.createDataFrame(test,['user_id', 'attr_name','attr_value', 'income', 'expenses'])

createDataFrame有一个参数,samplingRatio。这个参数的含义是:如果df的某列的类型不确定,则抽样百分之samplingRatio的数据来看是什么类型。因此,我们一般设定其为1。即,只要该列有1个数据不为空,该列的类型就不会为null。

二、查

2.1 行元素查询操作

打印数据

df.show()默认打印前20条数据,当然可以指定具体打印多少条数据。

如果有些属性值特别长,pyspark会截断数据导致打不全,这时候可以使用df.show(truncate=False)

 
  1. >>> df.show()

  2. +-------+---------+----------+------+--------+

  3. |user_id|attr_name|attr_value|income|expenses|

  4. +-------+---------+----------+------+--------+

  5. | 1| age| 30| 50| 40|

  6. | 1| city| beijing| 50| 40|

  7. | 1| gender| fale| 50| 40|

  8. | 1| height| 172cm| 50| 40|

  9. | 1| weight| 70kg| 50| 40|

  10. | 2| age| 26| 100| 80|

  11. | 2| city| beijing| 100| 80|

  12. | 2| gender| fale| 100| 80|

  13. | 2| height| 170cm| 100| 80|

  14. | 2| weight| 65kg| 100| 80|

  15. | 3| age| 35| 99| 99|

  16. | 3| city| nanjing| 99| 99|

  17. | 3| gender| female| 99| 99|

  18. | 3| height| 161cm| 99| 99|

  19. | 3| weight| 50kg| 99| 99|

  20. +-------+---------+----------+------+--------+

  21.  
  22. >>> df.show(3)

  23. +-------+---------+----------+------+--------+

  24. |user_id|attr_name|attr_value|income|expenses|

  25. +-------+---------+----------+------+--------+

  26. | 1| age| 30| 50| 40|

  27. | 1| city| beijing| 50| 40|

  28. | 1| gender| fale| 50| 40|

  29. +-------+---------+----------+------+--------+

  30. only showing top 3 rows

打印概要

 
  1. >>> df.printSchema()

  2. root

  3. |-- user_id: long (nullable = true)

  4. |-- attr_name: string (nullable = true)

  5. |-- attr_value: string (nullable = true)

  6. |-- income: long (nullable = true)

  7. |-- expenses: long (nullable = true)

查询总行数

 
  1. >>> df.count()

  2. 15

获取头几行到本地

 
  1. >>> list = df.head(3)

  2. >>> df.head(3)

  3. [Row(user_id=1, attr_name=u'age', attr_value=u'30', income=50, expenses=40), Row(user_id=1, attr_name=u'city', attr_value=u'beijing', income=50, expenses=40), Row(user_id=1, attr_name=u'gender', attr_value=u'fale', income=50, expenses=40)]

  4. >>> df.take(5)

  5. [Row(user_id=1, attr_name=u'age', attr_value=u'30', income=50, expenses=40), Row(user_id=1, attr_name=u'city', attr_value=u'beijing', income=50, expenses=40), Row(user_id=1, attr_name=u'gender', attr_value=u'fale', income=50, expenses=40), Row(user_id=1, attr_name=u'height', attr_value=u'172cm', income=50, expenses=40), Row(user_id=1, attr_name=u'weight', attr_value=u'70kg', income=50, expenses=40)]

查询某列为null的行

 
  1. >>> from pyspark.sql.functions import isnull

  2. >>> df = df.filter(isnull("income"))

  3. >>> df.show()

  4. 19/02/22 17:05:51 WARN DFSClient: Slow ReadProcessor read fields took 87487ms (threshold=30000ms); ack: seqno: 198 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 17565965 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.21.3.38:50010,DS-82aedc87-a850-40aa-9d04-dc62ab0988ef,DISK], DatanodeInfoWithStorage[172.21.80.165:50010,DS-305daec5-3c77-48cd-bee2-4f839aea8bb4,DISK], DatanodeInfoWithStorage[172.21.151.40:50010,DS-29ba84d5-ad7d-407f-9484-d85aa3f0a736,DISK]]

  5. +-------+---------+----------+------+--------+

  6. |user_id|attr_name|attr_value|income|expenses|

  7. +-------+---------+----------+------+--------+

  8. +-------+---------+----------+------+--------+

输出list类型,list中每个元素是Row类:

 
  1. >>> df.collect()

  2. [Row(user_id=1, attr_name=u'age', attr_value=u'30', income=50, expenses=40), Row(user_id=1, attr_name=u'city', attr_value=u'beijing', income=50, expenses=40), Row(user_id=1, attr_name=u'gender', attr_value=u'fale', income=50, expenses=40), Row(user_id=1, attr_name=u'height', attr_value=u'172cm', income=50, expenses=40), Row(user_id=1, attr_name=u'weight', attr_value=u'70kg', income=50, expenses=40), Row(user_id=2, attr_name=u'age', attr_value=u'26', income=100, expenses=80), Row(user_id=2, attr_name=u'city', attr_value=u'beijing', income=100, expenses=80), Row(user_id=2, attr_name=u'gender', attr_value=u'fale', income=100, expenses=80), Row(user_id=2, attr_name=u'height', attr_value=u'170cm', income=100, expenses=80), Row(user_id=2, attr_name=u'weight', attr_value=u'65kg', income=100, expenses=80), Row(user_id=3, attr_name=u'age', attr_value=u'35', income=99, expenses=99), Row(user_id=3, attr_name=u'city', attr_value=u'nanjing', income=99, expenses=99), Row(user_id=3, attr_name=u'gender', attr_value=u'female', income=99, expenses=99), Row(user_id=3, attr_name=u'height', attr_value=u'161cm', income=99, expenses=99), Row(user_id=3, attr_name=u'weight', attr_value=u'50kg', income=99, expenses=99)]

注:此方法将所有数据全部导入到本地,返回一个Array对象。当然,我们可以取出Array中的值,是一个Row,我们也可以取出Row中的值。

 
  1. >>> list = df.collect()

  2. >>> 19/02/22 16:54:04 WARN DFSClient: Slow ReadProcessor read fields took 43005ms (threshold=30000ms); ack: seqno: 179 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 18446744073455908425 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.21.3.38:50010,DS-82aedc87-a850-40aa-9d04-dc62ab0988ef,DISK], DatanodeInfoWithStorage[172.21.80.165:50010,DS-305daec5-3c77-48cd-bee2-4f839aea8bb4,DISK], DatanodeInfoWithStorage[172.21.151.40:50010,DS-29ba84d5-ad7d-407f-9484-d85aa3f0a736,DISK]]

  3.  
  4. >>> list[0]

  5. Row(user_id=1, attr_name=u'age', attr_value=u'30', income=50, expenses=40)

  6. >>> list[0][1]

  7. u'age'

查询概况

 
  1. >>> df.describe().show()

  2. 19/02/22 16:58:23 WARN DFSClient: Slow ReadProcessor read fields took 78649ms (threshold=30000ms); ack: seqno: 188 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 187817284 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.21.3.38:50010,DS-82aedc87-a850-40aa-9d04-dc62ab0988ef,DISK], DatanodeInfoWithStorage[172.21.80.165:50010,DS-305daec5-3c77-48cd-bee2-4f839aea8bb4,DISK], DatanodeInfoWithStorage[172.21.151.40:50010,DS-29ba84d5-ad7d-407f-9484-d85aa3f0a736,DISK]]

  3. +-------+------------------+---------+------------------+-----------------+------------------+

  4. |summary| user_id|attr_name| attr_value| income| expenses|

  5. +-------+------------------+---------+------------------+-----------------+------------------+

  6. | count| 15| 15| 15| 15| 15|

  7. | mean| 2.0| null|30.333333333333332| 83.0| 73.0|

  8. | stddev|0.8451542547285166| null| 4.509249752822894|24.15722311383137|25.453037988757707|

  9. | min| 1| age| 161cm| 50| 40|

  10. | max| 3| weight| nanjing| 100| 99|

  11. +-------+------------------+---------+------------------+-----------------+------------------+

去重set操作

 
  1. >>> df.select('user_id').distinct().show()

  2. +-------+

  3. |user_id|

  4. +-------+

  5. | 1|

  6. | 3|

  7. | 2|

  8. +-------+

2.2 列元素操作 

选择一列或多列:select

 
  1. df["age"]

  2. df.age

  3. df.select(“name”)

  4. df.select(df[‘name’], df[‘age’]+1)

  5. df.select(df.a, df.b, df.c) # 选择a、b、c三列

  6. df.select(df["a"], df["b"], df["c"]) # 选择a、b、c三列

用where按条件选择

 
  1. >>> df.where("income = 50" ).show()

  2. +-------+---------+----------+------+--------+

  3. |user_id|attr_name|attr_value|income|expenses|

  4. +-------+---------+----------+------+--------+

  5. | 1| age| 30| 50| 40|

  6. | 1| city| beijing| 50| 40|

  7. | 1| gender| fale| 50| 40|

  8. | 1| height| 172cm| 50| 40|

  9. | 1| weight| 70kg| 50| 40|

  10. +-------+---------+----------+------+--------+

2.3 排序

orderBy:按指定字段排序,默认为升序

 
  1. >>> df.orderBy(df.income.desc()).show()

  2. 19/02/22 18:02:31 WARN DFSClient: Slow ReadProcessor read fields took 87360ms (threshold=30000ms); ack: seqno: 325 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 14139744 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.21.3.38:50010,DS-82aedc87-a850-40aa-9d04-dc62ab0988ef,DISK], DatanodeInfoWithStorage[172.21.80.165:50010,DS-305daec5-3c77-48cd-bee2-4f839aea8bb4,DISK], DatanodeInfoWithStorage[172.21.151.40:50010,DS-29ba84d5-ad7d-407f-9484-d85aa3f0a736,DISK]]

  3. +-------+---------+----------+------+--------+

  4. |user_id|attr_name|attr_value|income|expenses|

  5. +-------+---------+----------+------+--------+

  6. | 2| gender| fale| 100| 80|

  7. | 2| weight| 65kg| 100| 80|

  8. | 2| height| 170cm| 100| 80|

  9. | 2| age| 26| 100| 80|

  10. | 2| city| beijing| 100| 80|

  11. | 3| gender| female| 99| 99|

  12. | 3| age| 35| 99| 99|

  13. | 3| height| 161cm| 99| 99|

  14. | 3| weight| 50kg| 99| 99|

  15. | 3| city| nanjing| 99| 99|

  16. | 1| age| 30| 50| 40|

  17. | 1| height| 172cm| 50| 40|

  18. | 1| city| beijing| 50| 40|

  19. | 1| weight| 70kg| 50| 40|

  20. | 1| gender| fale| 50| 40|

  21. +-------+---------+----------+------+--------+

2.4 抽样

sample是抽样函数,其中withReplacement = True or False代表是否有放回。42是seed。

t1 = train.sample(False, 0.2, 42)

三、 增、改

 新增数据列

withColumn是通过添加或替换与现有列有相同的名字的列,返回一个新的DataFrame。

但是,我们这么写会报错

 
  1. >>> df.withColumn('label', 0)

  2. Traceback (most recent call last):

  3. File "", line 1, in

  4. File "/software/servers/10k/mart_vdp/spark/python/pyspark/sql/dataframe.py", line 1848, in withColumn

  5. assert isinstance(col, Column), "col should be Column"

  6. AssertionError: col should be Column

报错:AssertionError: col should be Column,即一定要指定某现有列。有两种方式可以实现

一种方式通过functions

 
  1. >>> from pyspark.sql.functions import *

  2. >>> df.withColumn('label', lit(0))

  3. DataFrame[user_id: bigint, attr_name: string, attr_value: string, income: bigint, expenses: bigint, label: int]

另一种方式是通过另一个已有变量:

 
  1. >>> df.withColumn('income1', df.income+10).show(5)

  2. 19/02/22 18:25:03 WARN DFSClient: Slow ReadProcessor read fields took 34439ms (threshold=30000ms); ack: seqno: 382 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 26903061 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.21.3.38:50010,DS-82aedc87-a850-40aa-9d04-dc62ab0988ef,DISK], DatanodeInfoWithStorage[172.21.80.165:50010,DS-305daec5-3c77-48cd-bee2-4f839aea8bb4,DISK], DatanodeInfoWithStorage[172.21.151.40:50010,DS-29ba84d5-ad7d-407f-9484-d85aa3f0a736,DISK]]

  3. +-------+---------+----------+------+--------+-------+

  4. |user_id|attr_name|attr_value|income|expenses|income1|

  5. +-------+---------+----------+------+--------+-------+

  6. | 1| age| 30| 50| 40| 60|

  7. | 1| city| beijing| 50| 40| 60|

  8. | 1| gender| fale| 50| 40| 60|

  9. | 1| height| 172cm| 50| 40| 60|

  10. | 1| weight| 70kg| 50| 40| 60|

  11. +-------+---------+----------+------+--------+-------+

  12. only showing top 5 rows

修改列名

 
  1. >>> df.withColumnRenamed( "income" , "income2" ).show(3)

  2. +-------+---------+----------+-------+--------+

  3. |user_id|attr_name|attr_value|income2|expenses|

  4. +-------+---------+----------+-------+--------+

  5. | 1| age| 30| 50| 40|

  6. | 1| city| beijing| 50| 40|

  7. | 1| gender| fale| 50| 40|

  8. +-------+---------+----------+-------+--------+

  9. only showing top 3 rows

四、合并 join / union 

4.1 横向拼接union

利用union可以按行拼接

 
  1. >>> df.union(df).show()

  2. +-------+---------+----------+------+--------+

  3. |user_id|attr_name|attr_value|income|expenses|

  4. +-------+---------+----------+------+--------+

  5. | 1| age| 30| 50| 40|

  6. | 1| city| beijing| 50| 40|

  7. | 1| gender| fale| 50| 40|

  8. | 1| height| 172cm| 50| 40|

  9. | 1| weight| 70kg| 50| 40|

  10. | 2| age| 26| 100| 80|

  11. | 2| city| beijing| 100| 80|

  12. | 2| gender| fale| 100| 80|

  13. | 2| height| 170cm| 100| 80|

  14. | 2| weight| 65kg| 100| 80|

  15. | 3| age| 35| 99| 99|

  16. | 3| city| nanjing| 99| 99|

  17. | 3| gender| female| 99| 99|

  18. | 3| height| 161cm| 99| 99|

  19. | 3| weight| 50kg| 99| 99|

  20. | 1| age| 30| 50| 40|

  21. | 1| city| beijing| 50| 40|

  22. | 1| gender| fale| 50| 40|

  23. | 1| height| 172cm| 50| 40|

  24. | 1| weight| 70kg| 50| 40|

  25. +-------+---------+----------+------+--------+

  26. only showing top 20 rows

4.2 Join根据条件

单字段Join

合并2个表的join方法:

df_join = df_left.join(df_right, df_left.key == df_right.key, "inner")

4.3 求并集、交集

来看一个例子,先构造两个dataframe:

 
  1. sentenceDataFrame = spark.createDataFrame((

  2. (1, "asf"),

  3. (2, "2143"),

  4. (3, "rfds")

  5. )).toDF("label", "sentence")

  6. sentenceDataFrame.show()

  7.  
  8. sentenceDataFrame1 = spark.createDataFrame((

  9. (1, "asf"),

  10. (2, "2143"),

  11. (4, "f8934y")

  12. )).toDF("label", "sentence")

  13.  
  14. # 差集

  15. newDF = sentenceDataFrame1.select("sentence").subtract(sentenceDataFrame.select("sentence"))

  16. newDF.show()

  17.  
  18. +--------+

  19. |sentence|

  20. +--------+

  21. | f8934y|

  22. +--------+

  23.  
  24. # 交集

  25. newDF = sentenceDataFrame1.select("sentence").intersect(sentenceDataFrame.select("sentence"))

  26. newDF.show()

  27.  
  28. +--------+

  29. |sentence|

  30. +--------+

  31. | asf|

  32. | 2143|

  33. +--------+

  34.  
  35. # 并集

  36. newDF = sentenceDataFrame1.select("sentence").union(sentenceDataFrame.select("sentence"))

  37. newDF.show()

  38.  
  39. +--------+

  40. |sentence|

  41. +--------+

  42. | asf|

  43. | 2143|

  44. | f8934y|

  45. | asf|

  46. | 2143|

  47. | rfds|

  48. +--------+

  49.  
  50.  

4.4 分割:行转列

有时候需要根据某个字段内容进行分割,然后生成多行,这时可以使用explode方法。下

面代码中,根据c3字段中的空格将字段内容进行分割,分割的内容存储在新的字段c3_中,如下所示

 

注:spark的“惰性”性质导致上面的结果,即df.count()是一个Transformations操作,只有执行Action时,

五、 频数统计与筛选

在stat模块中。参考文献【2】有详细介绍。

分组统计 group by

 
  1. train.groupby('Age').agg({'Purchase': 'mean'}).show()

  2. Output:

  3. +-----+-----------------+

  4. | Age| avg(Purchase)|

  5. +-----+-----------------+

  6. |51-55|9534.808030960236|

  7. |46-50|9208.625697468327|

  8. | 0-17|8933.464640444974|

  9. |36-45|9331.350694917874|

  10. |26-35|9252.690632869888|

  11. | 55+|9336.280459449405|

  12. |18-25|9169.663606261289|

  13. +-----+-----------------+

应用多个函数

df.groupBy(“A”).agg(functions.avg(“B”), functions.min(“B”), functions.max(“B”)).show()

apply 函数

udf 函数应用

等等。

当然,pyspark的df的功能特别强大。我这里就不再一一举例了,详见参考文献【2】

参考文献:

【1】PySpark︱DataFrame操作指南:增/删/改/查/合并/统计与数据处理

【2】pyspark.sql module

你可能感兴趣的:(pyspark)