spark windows环境下开发环境快速搭建。

spark本地练习环境搭建

练习环境搭建

为了避免搭建环境阻塞对spark本身的学习(当然这部分也很重要,当对spark本身有了基本了解后应该学习这部分内容),可采用本地的模式快速搭建练习环境,具体环境搭建参考:Spark在Windows下的环境搭建 本人亲测可用,希望对入门大数据的朋友有所帮助。

IDEA开发环境搭建

在有了上面的环境之后,可以在spark-shell下进行简单的练习,但是对于在命令行上编程不熟悉的人,以及复杂的练习就不行了。下面记录IDEA上开发环境的搭建。

scala> val textFile = sc.textFile("file:///root/install.log.syslog")
textFile: org.apache.spark.rdd.RDD[String] = file:///root/install.log.syslog Map                                                                                        PartitionsRDD[1] at textFile at :24

scala> textFile.count
res0: Long = 159

scala> textFile.flatMap(line => line.split(" ")).map(word=>(word,1)).reduceByKey                                                                                        (_+_)
res1: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at                                                                                         :27

scala> val wordsCount = textFile.flatMap(line => line.split(" ")).map(word=>(wor                                                                                        d,1)).reduceByKey(_+_)
wordsCount: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[7] at reduceBy                                                                                        Key at :26
  • 前置条件:上一不已经正确完成。
  1. 新建java控制台项目,并转为maven项目。
  2. pom文件如下:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0modelVersion>

  <groupId>com.lumerisgroupId>
  <artifactId>wordcountartifactId>
  <version>0.0.1-SNAPSHOTversion>
  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.pluginsgroupId>
        <artifactId>maven-compiler-pluginartifactId>
        <configuration>
          <source>1.8source>
          <target>1.8target>
        configuration>
      plugin>
    plugins>
  build>
  <packaging>jarpackaging>

  <name>wordcountname>
  <url>http://maven.apache.orgurl>

  <dependencies>
    <dependency> 
      <groupId>org.apache.sparkgroupId>
      <artifactId>spark-core_2.10artifactId>
      <version>2.0.0version>
    dependency>
  dependencies>

project>
  1. 新建Sample.java文件。
package com.lumeris.wordcount;

import java.io.Serializable;
import java.util.Arrays;
import java.util.Iterator;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import scala.Tuple2;

public class Sample implements Serializable {
	public static void main(String[] args){
		String filepath = "D:\\study\\wordcount\\wordcount\\src\\main\\resources\\test.log";

		SparkConf conf = new SparkConf().setMaster("local").setAppName("Simple Application");
		JavaSparkContext sc = new JavaSparkContext(conf);
		JavaRDD logData = sc.textFile(filepath);

		/*
		filterLines(logData);

		JavaRDD logDateRdd = logData.map( s -> s.split("]")[0]);
		List dates = logDateRdd.collect();
		dates.forEach(System.out::println);
		*/

		long errorCount = logData.filter(p->p.contains("error")).count();
		System.out.println(errorCount);

		long mostWordsCount = logData.map(p->p.split(" ").length).reduce(Math::max);
		System.out.println(mostWordsCount);

		JavaPairRDD wordCount = logData.flatMap(s -> Arrays.asList(s.split(" " )).iterator()).mapToPair(p->new Tuple2<>(p,1)).reduceByKey((a, b)->a+b).sortByKey();

		//wordCount.foreach(p->System.out.println(p));

		//缓存到集群的内存,此时访问wordcount不需要重新计算。
		//wordCount.cache();

	}
	
	public static void filterLines(final JavaRDD logData){
		JavaRDD<String> logDataWitha = logData.filter(new Function(){
			public Boolean call(String s){
				return s.contains("a");
			}
		});
		long numAs = logDataWitha.count();
		
		JavaRDD<String> logDataWithb = logData.filter(new Function(){
			public Boolean call(String s){
				return s.contains("b");
			}
		});
		long numBs = logDataWithb.count();
		
		System.out.println("Lines with a are : " + numAs + ", lines with b are : " + numBs);
	}

}

测试可以直接运行main方法了,里边有spark得基本操作的例子。 现实中我们的代码要提交到集群运行,这部分可以参考官方文档 ,另外,一般我们的代码是在web容器比如tomcat提交的,因此spark提供了个叫做SparkLuncher的类,通过它我们可以提交我们的代码到spark,代码如下:

package com.lumeris.wordcount;

import org.apache.spark.launcher.SparkAppHandle;
import org.apache.spark.launcher.SparkLauncher;

public class SparkLuncher {
    public static void  main(String[] args) throws Exception
    {
        SparkLauncher launcher = new SparkLauncher();
        launcher.setAppName("APPName");
//        launcher.setMaster("yarn-cluster");
        launcher.setMaster("local");
        launcher.setAppResource("D:\\study\\wordcount\\wordcount\\target\\wordcount-0.0.1-SNAPSHOT.jar");
        launcher.setMainClass("com.lumeris.wordcount.Sample");
//        launcher.setConf(SparkLauncher.DRIVER_MEMORY, "2g");
//        launcher.setConf(SparkLauncher.EXECUTOR_MEMORY, "2g");
//        launcher.setConf(SparkLauncher.EXECUTOR_CORES, "3");
        //jar包路径为在整个文件系统中的路径。
        launcher.addAppArgs(new String[]{"",""});//SparkApp Main args
        launcher.setVerbose(true);
        launcher.setSparkHome("D:\\lib\\spark-2.2.0-bin-hadoop2.7");

        SparkAppHandle handle = launcher.startApplication();

        while(handle.getState() != SparkAppHandle.State.FINISHED) {
            Thread.sleep(1000L);
            System.out.println("applicationId is: "+ handle.getAppId());
            System.out.println("current state: "+ handle.getState());
//            handle.stop();
        }
    }
}

完了。 接下来就可以继续深入学习那些针对rdd的API接口了。顺便提醒别忘了学习搭建spark集群的姿势。

参考文档:csdn博客

你可能感兴趣的:(Spark)