大数据运维Mahout

Mahout 题:

  1. 在 master 节点安装 Mahout Client,打开 Linux Shell 运行 mahout 命令查看Mahout 自带的案例程序。
    [root@master ~]# mahout
    MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
    Running on hadoop, using /usr/hdp/2.6.1.0-129/hadoop/bin/hadoop and HADOOP_CONF_DIR=/usr/hdp/2.6.1.0-129/hadoop/conf
    MAHOUT-JOB: /usr/hdp/2.6.1.0-129/mahout/mahout-examples-0.9.0.2.6.1.0-129-job.jar
    An example program must be given as the first argument.
    Valid program names are:
    arff.vector: : Generate Vectors from an ARFF file or directory
    baumwelch: : Baum-Welch algorithm for unsupervised HMM training
    buildforest: : Build the random forest classifier
    canopy: : Canopy clustering
    cat: : Print a file or resource as the logistic regression models would see it
    cleansvd: : Cleanup and verification of SVD output
    clusterdump: : Dump cluster output to text
    clusterpp: : Groups Clustering Output In Clusters
    cmdump: : Dump confusion matrix in HTML or text formats
    concatmatrices: : Concatenates 2 matrices of same cardinality into a single matrix
    cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)
    cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.
    describe: : Describe the fields and target variable in a data set
    evaluateFactorization: : compute RMSE and MAE of a rating matrix factorization against probes
    fkmeans: : Fuzzy K-means clustering
    hmmpredict: : Generate random sequence of observations by given HMM
    itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering
    kmeans: : K-means clustering
    lucene.vector: : Generate Vectors from a Lucene index
    lucene2seq: : Generate Text SequenceFiles from a Lucene index
    matrixdump: : Dump matrix in CSV format
    matrixmult: : Take the product of two matrices
    parallelALS: : ALS-WR factorization of a rating matrix
    qualcluster: : Runs clustering experiments and summarizes results in a CSV
    recommendfactorized: : Compute recommendations using the factorization of a rating matrix
    recommenditembased: : Compute recommendations using item-based collaborative filtering
    regexconverter: : Convert text files on a per line basis based on regular expressions
    resplit: : Splits a set of SequenceFiles into a number of equal splits
    rowid: : Map SequenceFile to {SequenceFile, SequenceFile}
    rowsimilarity: : Compute the pairwise similarities of the rows of a matrix
    runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model
    runlogistic: : Run a logistic regression model against CSV data
    seq2encoded: : Encoded Sparse Vector generation from Text sequence files
    seq2sparse: : Sparse Vector generation from Text sequence files
    seqdirectory: : Generate sequence files (of Text) from a directory
    seqdumper: : Generic Sequence File dumper
    seqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archives
    seqwiki: : Wikipedia xml dump to sequence file
    spectralkmeans: : Spectral k-means clustering
    split: : Split Input data into test and train sets
    splitDataset: : split a rating dataset into training and probe parts
    ssvd: : Stochastic SVD
    streamingkmeans: : Streaming k-means clustering
    svd: : Lanczos Singular Value Decomposition
    testforest: : Test the random forest classifier
    testnb: : Test the Vector-based Bayes classifier
    trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model
    trainlogistic: : Train a logistic regression using stochastic gradient descent
    trainnb: : Train the Vector-based Bayes classifier
    transpose: : Take the transpose of a matrix
    validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set
    vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors
    vectordump: : Dump vectors from a sequence file to text
    viterbi: : Viterbi decoding of hidden states from given output states sequence
    19/05/16 13:28:15 WARN util.ShutdownHookManager: ShutdownHook ‘’ timeout, java.util.concurrent.TimeoutException
    java.util.concurrent.TimeoutException
    at java.util.concurrent.FutureTask.get(FutureTask.java:205)
    at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67)

  2. 使用 Mahout 工具将解压后的 20news-bydate.tar.gz 文件内容转换成序列文件,保存到/data/mahout/20news/output/20news-seq/目录中,并查看该目录的列表信息。
    [root@master ~]# mkdir 20new
    [root@master ~]# tar -zxf /opt/20news-bydate.tar.gz -C 20new/
    [root@master ~]# hadoop fs -mkdir -p /data/mahout/20news/20news-all
    [root@master ~]#hadoop fs -put 20new/* /data/mahout/20news/20news-all
    [hdfs@master ~]$mahout seqdirectory -i /data/mahout/20news/20news-all -o /data/mahout/20news/output/20news-seq

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/hdp/2.6.1.0-129/hadoop/bin/hadoop and HADOOP_CONF_DIR=/usr/hdp/2.6.1.0-129/hadoop/conf
MAHOUT-JOB: /usr/hdp/2.6.1.0-129/mahout/mahout-examples-0.9.0.2.6.1.0-129-job.jar
19/05/21 15:15:26 WARN driver.MahoutDriver: No seqdirectory.props found on classpath, will use command-
19/05/21 15:16:25 INFO mapreduce.JobSubmitter: number of splits:1
19/05/21 15:16:26 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1558132743195_0001
19/05/21 15:16:29 INFO impl.YarnClientImpl: Application submission is not finished, submitted application application_1558132743195_0001 is still in NEW_SAVING
19/05/21 15:16:31 INFO impl.YarnClientImpl: Submitted application application_1558132743195_0001
19/05/21 15:16:31 INFO mapreduce.Job: The url to track the job: http:// slaver1.hadoop:8088/proxy/application_1558132743195_0001/
19/05/21 15:16:31 INFO mapreduce.Job: Running job: job_1558132743195_0001
19/05/21 15:16:52 INFO mapreduce.Job: Job job_1558132743195_0001 running in uber mode : false
19/05/21 15:16:52 INFO mapreduce.Job: map 0% reduce 0%
19/05/21 15:17:03 INFO mapreduce.Job: map 7% reduce 0%
19/05/21 15:17:06 INFO mapreduce.Job: map 12% reduce 0%
19/05/21 15:17:09 INFO mapreduce.Job: map 16% reduce 0%
19/05/21 15:17:12 INFO mapreduce.Job: map 20% reduce 0%
19/05/21 15:17:14 INFO mapreduce.Job: map 22% reduce 0%
19/05/21 15:17:17 INFO mapreduce.Job: map 25% reduce 0%
19/05/21 15:17:20 INFO mapreduce.Job: map 32% reduce 0%
19/05/21 15:17:23 INFO mapreduce.Job: map 39% reduce 0%
19/05/21 15:17:26 INFO mapreduce.Job: map 47% reduce 0%
19/05/21 15:17:29 INFO mapreduce.Job: map 55% reduce 0%
19/05/21 15:17:32 INFO mapreduce.Job: map 61% reduce 0%
19/05/21 15:17:35 INFO mapreduce.Job: map 67% reduce 0%
19/05/21 15:17:38 INFO mapreduce.Job: map 75% reduce 0%
19/05/21 15:17:41 INFO mapreduce.Job: map 83% reduce 0%
19/05/21 15:17:44 INFO mapreduce.Job: map 92% reduce 0%
19/05/21 15:17:47 INFO mapreduce.Job: map 100% reduce 0%
19/05/21 15:17:47 INFO mapreduce.Job: Job job_1558132743195_0001 completed successfully
19/05/21 15:17:48 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=151414
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=35728752
HDFS: Number of bytes written=12816846
HDFS: Number of read operations=71984
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=107006
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=53503
Total vcore-milliseconds taken by all map tasks=53503
Total megabyte-milliseconds taken by all map tasks=82180608
Map-Reduce Framework
Map input records=17995
Map output records=17995
Input split bytes=2055363
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=358
CPU time spent (ms)=39420
Physical memory (bytes) snapshot=335376384
Virtual memory (bytes) snapshot=3257303040
Total committed heap usage (bytes)=176160768
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=12816846
19/05/21 15:17:48 INFO driver.MahoutDriver: Program took 138608 ms (Minutes: 2.3101333333333334)
[hdfs@master ~]$hadoop fs -ls /data/mahout/20news/output/20news-seq
Found 2 items
-rw-r–r-- 3 hdfs hdfs 0 2019-05-21 15:17 /data/mahout/20news/output/20news-seq/_SUCCESS
-rw-r–r-- 3 hdfs hdfs 12816846 2019-05-21 15:17 /data/mahout/20news/output/20news-seq/part-m-00000

  1. 使用 Mahout 工具将解压后的 20news-bydate.tar.gz 文件内容转换成序列文件,保存到/data/mahout/20news/output/20news-seq/目录中,使用-text 命令查看序列文件内容(前 20 行即可)。
    [root@master ~]#mkdir 20new
    [root@master ~]# tar -zxf /opt/20news-bydate.tar.gz -C 20new/
    [root@master ~]# hadoop fs -mkdir -p /data/mahout/20news/20news-all
    [root@master ~]# hadoop fs -put 20new/* /data/mahout/20news/20news-all
    [hdfs@master ~]$ mahout seqdirectory -i /data/mahout/20news/20news-all -o /data/mahout/20news/output/20news-seq
    MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
    Running on hadoop, using /usr/hdp/2.6.1.0-129/hadoop/bin/hadoop and HADOOP_CONF_DIR=/usr/hdp/2.6.1.0-129/hadoop/conf
    MAHOUT-JOB: /usr/hdp/2.6.1.0-129/mahout/mahout-examples-0.9.0.2.6.1.0-129-job.jar
    19/05/21 15:15:26 WARN driver.MahoutDriver: No seqdirectory.props found on classpath, will use command-
    19/05/21 15:16:25 INFO mapreduce.JobSubmitter: number of splits:1
    19/05/21 15:16:26 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1558132743195_0001
    19/05/21 15:16:29 INFO impl.YarnClientImpl: Application submission is not finished, submitted application application_1558132743195_0001 is still in NEW_SAVING
    19/05/21 15:16:31 INFO impl.YarnClientImpl: Submitted application application_1558132743195_0001
    19/05/21 15:16:31 INFO mapreduce.Job: The url to track the job: http:// slaver1.hadoop:8088/proxy/application_1558132743195_0001/
    19/05/21 15:16:31 INFO mapreduce.Job: Running job: job_1558132743195_0001
    19/05/21 15:16:52 INFO mapreduce.Job: Job job_1558132743195_0001 running in uber mode : false
    19/05/21 15:16:52 INFO mapreduce.Job: map 0% reduce 0%
    19/05/21 15:17:03 INFO mapreduce.Job: map 7% reduce 0%
    19/05/21 15:17:06 INFO mapreduce.Job: map 12% reduce 0%
    19/05/21 15:17:09 INFO mapreduce.Job: map 16% reduce 0%
    19/05/21 15:17:12 INFO mapreduce.Job: map 20% reduce 0%
    19/05/21 15:17:14 INFO mapreduce.Job: map 22% reduce 0%
    19/05/21 15:17:17 INFO mapreduce.Job: map 25% reduce 0%
    19/05/21 15:17:20 INFO mapreduce.Job: map 32% reduce 0%
    19/05/21 15:17:23 INFO mapreduce.Job: map 39% reduce 0%
    19/05/21 15:17:26 INFO mapreduce.Job: map 47% reduce 0%
    19/05/21 15:17:29 INFO mapreduce.Job: map 55% reduce 0%
    19/05/21 15:17:32 INFO mapreduce.Job: map 61% reduce 0%
    19/05/21 15:17:35 INFO mapreduce.Job: map 67% reduce 0%
    19/05/21 15:17:38 INFO mapreduce.Job: map 75% reduce 0%
    19/05/21 15:17:41 INFO mapreduce.Job: map 83% reduce 0%
    19/05/21 15:17:44 INFO mapreduce.Job: map 92% reduce 0%
    19/05/21 15:17:47 INFO mapreduce.Job: map 100% reduce 0%
    19/05/21 15:17:47 INFO mapreduce.Job: Job job_1558132743195_0001 completed successfully
    19/05/21 15:17:48 INFO mapreduce.Job: Counters: 30
    File System Counters
    FILE: Number of bytes read=0
    FILE: Number of bytes written=151414
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
    HDFS: Number of bytes read=35728752
    HDFS: Number of bytes written=12816846
    HDFS: Number of read operations=71984
    HDFS: Number of large read operations=0
    HDFS: Number of write operations=2
    Job Counters
    Launched map tasks=1
    Other local map tasks=1
    Total time spent by all maps in occupied slots (ms)=107006
    Total time spent by all reduces in occupied slots (ms)=0
    Total time spent by all map tasks (ms)=53503
    Total vcore-milliseconds taken by all map tasks=53503
    Total megabyte-milliseconds taken by all map tasks=82180608
    Map-Reduce Framework
    Map input records=17995
    Map output records=17995
    Input split bytes=2055363
    Spilled Records=0
    Failed Shuffles=0
    Merged Map outputs=0
    GC time elapsed (ms)=358
    CPU time spent (ms)=39420
    Physical memory (bytes) snapshot=335376384
    Virtual memory (bytes) snapshot=3257303040
    Total committed heap usage (bytes)=176160768
    File Input Format Counters
    Bytes Read=0
    File Output Format Counters
    Bytes Written=12816846
    19/05/21 15:17:48 INFO driver.MahoutDriver: Program took 138608 ms (Minutes: 2.3101333333333334)
    [hdfs@master ~]$hadoop fs -text /data/mahout/20news/output/20news-seq/part-m-00000 | head -n 20
    19/05/21 15:25:09 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
    19/05/21 15:25:09 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
    19/05/21 15:25:09 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
    19/05/21 15:25:09 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
    19/05/21 15:25:09 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
    /20news-bydate-test/alt.atheism/53068 From: decay @cbnewsj.cb.att.com (dean.kaflowitz)
    Subject: Re: about the bible quiz answers
    Organization: AT&T
    Distribution: na
    Lines: 18

In article , healta@ saturn.wwc. edu (Tammy R Healy) writes:
'>
'>
'> #12) The 2 cheribums are on the Ark of the Covenant. When God said make no
'> graven image, he was refering to idols, which were created to be worshipped.
'> The Ark of the Covenant wasn’t wrodhipped and only the high priest could
'> enter the Holy of Holies where it was kept once a year, on the Day of
'> Atonement.

I am not familiar with, or knowledgeable about the original language,
but I believe there is a word for “idol” and that the translator
would have used the word “idol” instead of “graven image” had
the original said “idol.” So I think you’re wrong here, but
then again I could be too. I just suggesting a way to determine
text: Unable to write to output stream.

  1. 使用 Mahout 挖掘工具对数据集 user-item-score.txt(用户-物品-得分)进行物品推荐,要求采用基于项目的协同过滤算法,欧几里得距离公式定义,并且每位用户的推荐个数为 3,设置非布尔数据,最大偏好值为 4,最小偏好值为 1,将推荐输出结果保存到 output 目录中,通过-cat 命令查询输出结果part-r-00000 中的内容。
    [root@master ~]# hadoop fs -mkdir -p /data/mahout/project
    [root@master ~]# hadoop fs -put /opt/user-item-score.txt /data/mahout/project
    [root@master ~]# mahout recommenditembased -i /data/mahout/project/user-item-score.txt -o /data/mahout/project/output -n 3 -b false -s SIMILARITY_EUCLIDEAN_DISTANCE –maxPrefsPerUser 4 –minPrefsPerUser1 –maxPrefslnltemSimilarity 4 –tempDir /data/mahout/project/temp

你可能感兴趣的:(大数据)