在mac上安装下pySpark,并且在pyCharm中python调用pyspark。目前用python比较多,所以想安装下pySpark,并且在pyCharm中调用。
ython 利用pyspark 直接在本地操作spark,运行spark程序
本文将从软件下载,安装,第一部分配置,编程,初次运行,第二部分配置,最终正确运行,这几个方面进行,下面,闲话不说,码上呈现过程。
localhost:python a6$ pwd
/Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python
localhost:python a6$ cd pyspark/
localhost:pyspark a6$ ls
__init__.py broadcast.pyc context.py find_spark_home.py java_gateway.pyc profiler.py rddsampler.pyc shell.py statcounter.pyc streaming version.pyc
__init__.pyc cloudpickle.py context.pyc find_spark_home.pyc join.py profiler.pyc resultiterable.py shuffle.py status.py tests.py worker.py
accumulators.py cloudpickle.pyc daemon.py heapq3.py join.pyc rdd.py resultiterable.pyc shuffle.pyc status.pyc traceback_utils.py
accumulators.pyc conf.py files.py heapq3.pyc ml rdd.pyc serializers.py sql storagelevel.py traceback_utils.pyc
broadcast.py conf.pyc files.pyc java_gateway.py mllib rddsampler.py serializers.pyc statcounter.py storagelevel.pyc version.py
localhost:python a6$ python
Python 2.7.10 (default, Feb 7 2017, 00:08:15)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print sys.path
['', '/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg', '/Library/Python/2.7/site-packages/py4j-0.10.6-py2.7.egg', '/Library/Python/2.7/site-packages/redis-2.10.6-py2.7.egg', '/Library/Python/2.7/site-packages/MySQL_python-1.2.4-py2.7-macosx-10.12-intel.egg', '/Library/Python/2.7/site-packages/thrift-0.10.0-py2.7-macosx-10.12-intel.egg', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python27.zip', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-darwin', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac/lib-scriptpackages', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-tk', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-old', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload', '/Library/Python/2.7/site-packages', '/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python', '/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/PyObjC']
>>> exit()
localhost:site-packages a6$ pwd
/Library/Python/2.7/site-packages
localhost:site-packages a6$ mkdir pyspark
mkdir: pyspark: Permission denied
localhost:site-packages a6$ sudo mkdir pyspark
Password:
localhost:pyspark a6$ pwd
/Library/Python/2.7/site-packages/pyspark
localhost:pyspark a6$ cp /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/* ./
cp: ./__init__.py: Permission denied
cp: ./__init__.pyc: Permission denied
cp: ./accumulators.py: Permission denied
cp: ./accumulators.pyc: Permission denied
cp: ./broadcast.py: Permission denied
cp: ./broadcast.pyc: Permission denied
…………
cp: ./join.pyc: Permission denied
cp: /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/ml is a directory (not copied).
cp: /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/mllib is a directory (not copied).
cp: ./profiler.py: Permission denied
cp: ./profiler.pyc: Permission denied
cp: ./rdd.py: Permission denied
cp: ./rdd.pyc: Permission denied
cp: ./rddsampler.py: Permission denied
cp: ./rddsampler.pyc: Permission denied
cp: ./resultiterable.py: Permission denied
cp: ./resultiterable.pyc: Permission denied
cp: ./serializers.py: Permission denied
cp: ./serializers.pyc: Permission denied
localhost:pyspark a6$ sudo cp /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/* ./
cp: /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/ml is a directory (not copied).
cp: /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/mllib is a directory (not copied).
cp: /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/sql is a directory (not copied).
cp: /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/streaming is a directory (not copied).
localhost:pyspark a6$ sudo cp -rf /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/* ./
localhost:pyspark a6$ ls
__init__.py broadcast.pyc context.py find_spark_home.py java_gateway.pyc profiler.py rddsampler.pyc shell.py statcounter.pyc streaming version.pyc
__init__.pyc cloudpickle.py context.pyc find_spark_home.pyc join.py profiler.pyc resultiterable.py shuffle.py status.py tests.py worker.py
accumulators.py cloudpickle.pyc daemon.py heapq3.py join.pyc rdd.py resultiterable.pyc shuffle.pyc status.pyc traceback_utils.py
accumulators.pyc conf.py files.py heapq3.pyc ml rdd.pyc serializers.py sql storagelevel.py traceback_utils.pyc
broadcast.py conf.pyc files.pyc java_gateway.py mllib rddsampler.py serializers.pyc statcounter.py storagelevel.pyc version.py
localhost:pyspark a6$
from operator import add
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext(appName="PythonWordCount")
lines = sc.textFile('words.txt')
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print "%s: %i" % (word, count)
sc.stop()
good bad cool
hadoop spark mlib
good spark mlib
cool spark bad
/System/Library/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/a6/Downloads/PycharmProjects/test_use_hbase_by_thrift/test_python_local_use_spark.py
Could not find valid SPARK_HOME while searching ['/Users/a6/Downloads/PycharmProjects', '/Library/Python/2.7/site-packages/pyspark', '/Library/Python/2.7/site-packages/pyspark', '/Library/Python/2.7']
Process finished with exit code 255
/System/Library/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/a6/Downloads/PycharmProjects/test_use_hbase_by_thrift/test_python_local_use_spark.py
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/10/13 16:30:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/10/13 16:30:48 WARN Utils: Your hostname, localhost resolves to a loopback address: 127.0.0.1; using 10.2.32.96 instead (on interface en0)
17/10/13 16:30:48 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
/Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/shuffle.py:58: UserWarning: Please install psutil to have better support with spilling
/Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/shuffle.py:58: UserWarning: Please install psutil to have better support with spilling
bad: 2
spark: 3
mlib: 2
good: 2
hadoop: 1
cool: 2
Process finished with exit code 0
示例如下:
localhost:python a6$ python
Python 2.7.10 (default, Feb 7 2017, 00:08:15)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print sys.path
['', '/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg', '/Library/Python/2.7/site-packages/py4j-0.10.6-py2.7.egg', '/Library/Python/2.7/site-packages/redis-2.10.6-py2.7.egg', '/Library/Python/2.7/site-packages/MySQL_python-1.2.4-py2.7-macosx-10.12-intel.egg', '/Library/Python/2.7/site-packages/thrift-0.10.0-py2.7-macosx-10.12-intel.egg', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python27.zip', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-darwin', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac/lib-scriptpackages', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-tk', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-old', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload', '/Library/Python/2.7/site-packages', '/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python', '/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/PyObjC’]