jupyter notebook + pyspark 环境搭建

安装并启动jupyter

安装 Anaconda 后, 再安装 jupyter
pip install jupyter

设置环境
ipython --ipython-dir= # override the default IPYTHONDIR directory, ~/.ipython/ by default
ipython profile create foo # create the profile foo
ipython profile locate foo # find foo profile directory, IPYTHONDIR by default,
ipython --profile=foo # start IPython using the new profile

启动jupyter的几个命令, 启动后, 默认还将启动一个浏览器进入 notebook 环境
ipython notebook # 启动 jupyter notebook服务器, 使用默认端口8080
ipython notebook --ip=0.0.0.0 --port=80 # 启动 jupyter notebook服务器, 指定端口
ipython notebook --profile=foo # 使用 foo profile 启动 jupyter notebook服务器
ipython notebook --pylab inline # 启用 PyLab graphing support

更多jupyter使用信息, 见
http://nbviewer.jupyter.org/github/ipython/ipython/blob/3.x/examples/Notebook/Notebook%20Basics.ipynb

定制Jupyter

配置Notebook App的基本信息

文件名为: ~/.ipython/profile_foo/ipython_notebook_config.py

c = get_config()
c.IPKernelApp.pylab = 'inline'
c.NoteBookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8880 # or whatever you want

为 notebook 的 cell增加line number

在 ~/.ipython/profile_foo/static/custom/custom.js 增加下面几行

define([
    'base/js/namespace',
    'base/js/events'
    ], 
    function(IPython, events) {
        events.on("app_initialized.NotebookApp", 
            function () {
                require("notebook/js/cell").Cell.options_default.cm_config.lineNumbers = true;
            }
        );
    }
);

更改jupyter的主题

https://github.com/transcranial/jupyter-themer
更改命令
jupyter-themer -c monokai

与PySpark集成

IPython和普通的Python interpreter相比, 优点在于对交互性支持更好, 所以PySpark只有在需要更好交互性的情形下, 才集成IPython的必要, 显然只有 pyspark shell 才需要集成IPython.
Jupyter和PySpark shell集成方式有好几种, 比如:

  1. 先启动IPython, 然后调用pyspark\shell.py启动spark.
    启动IPython后, 我们可以手动调用pyspark\shell.py, 将调用脚本加到IPython profile目录中自动启动, 自动启动python程序. 调用pyspark\shell.py应放在文件 ~/.ipython/profile_foo/startup/00-pyspark-setup.py 中.
    00-pyspark-setup.py的写法可参考 https://github.com/harisekhon/pytools/blob/master/.ipython-notebook-pyspark.00-pyspark-setup.py

  2. 采用IPython 这个高级 interpreter 来启动pyspark
    下面是一个例子, 在 spark master server 上启动 pyspark shell.
    spark_master_node$ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7777 --profile=foo" pyspark --packages com.databricks:spark-csv_2.10:1.1.0 --master spark://spark_master_hostname:7077 --executor-memory 6400M --driver-memory 6400M

设置 PYSPARK_DRIVER_PYTHONPYSPARK_DRIVER_PYTHON_OPTS 环境变量后, 之后调用pyspark将采用这两个环境变量指定的Python 解释器配置来运行python 版spark 应用.
注意不应该export 这两个环境变量, 因为export后, 非shell的pyspark spark应用也将使用IPython运行, 容易造成滥用.

为了简化提交pyspark 应用的提交,可以预先设置一个 PYSPARK_SUBMIT_ARGS 环境变量.
export PYSPARK_SUBMIT_ARGS='--master local[2]'
export PYSPARK_SUBMIT_ARGS='--master yarn --deploy-mode client --num-executors 24 --executor-memory 10g --executor-cores 5'

参考文章

How-to: Use IPython Notebook with Apache Spark
http://www.tuicool.com/articles/rqIv6z
http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with
How to Install PySpark and Integrate with IPython Notebook
https://www.dataquest.io/blog/installing-pyspark/
http://www.tuicool.com/articles/VFn6j2Y
Configuring IPython Notebook Support for PySpark
http://ramhiser.com/2015/02/01/configuring-ipython-notebook-support-for-pyspark/
Using Jupyter on Apache Spark: Step-by-Step with a Terabyte of Reddit Data
http://blog.insightdatalabs.com/jupyter-on-apache-spark-step-by-step/
如何自定义jupyter notebook的主题
http://www.cnblogs.com/wybert/p/5030697.html
jupyter cell 增加 line number
https://stackoverflow.com/questions/20197471/how-to-display-line-numbers-in-ipython-notebook-code-cell-by-default/20197878
Spark编程环境搭建(IPython)
http://www.kinelf.com/?p=169
如何使用Docker快速配置数据科学开发环境(搭建Docker + Jupyter环境 )
https://linux.cn/article-6644-1.html

你可能感兴趣的:(jupyter notebook + pyspark 环境搭建)