数据主要是csv文件,是股票的涨跌和每日点击量前20条新闻的关系,里面包括了日期,标签(0,1,0代表跌,1代表涨),以及25列新闻标题
关于数据的处理,首先我将25列新闻标题合并为1列,然后根据日期分为训练集和测试集,通过tf-idf进行文本的处理,将数据进行逻辑回归的模型训练,最后进行测试集的预测
csv数据大致如下:
好了,show my code:
from pyspark.sql import SparkSession
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml.classification import LogisticRegression
spark = SparkSession \
.builder \
.appName("stock") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
df = spark.read.option("header", "true").csv("E:\\web_file\\Stock\\Combined_News_DJIA.csv")
df.createOrReplaceTempView("stock")
#读取原始数据,将文本的列合并,将label转化为数值型
data=spark.sql("SELECT int(Label) as label,concat_ws(" "Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,Top9,Top10,Top11,Top12,Top13,Top14,Top15,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25) as term FROM stock where Date>= '2008-08-08' and Date<='2014-12-31'")
#将文本分成单词
tokenizer = Tokenizer(inputCol="term", outputCol="words")
wordsData = tokenizer.transform(data)
#从一个文档中计算出给定大小的词频向量,采用了hash法,要求每个“文档”都使用对象的可迭代序列来表示
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures")
featurizedData = hashingTF.transform(wordsData)
#计算IDF
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)
#缓存处理好的数据
rescaledData.cache()
#逻辑回归训练模型
lr =LogisticRegression(maxIter=100, regParam=0.01)
model=lr.fit(rescaledData)
#获取结果集
test=spark.sql("select * from(SELECT int(Label) as label, concat_ws(" "Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,Top9,Top10,Top11,Top12,Top13,Top14,Top15,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25) as term FROM stock where Date>= '2015-01-02') where term is not null")
#对测试集做一系列和训练集一样的操作
wordsDataTest=tokenizer.transform(test)
featurizedDataTest = hashingTF.transform(wordsDataTest)
idfModelTest = idf.fit(featurizedDataTest)
rescaledDataTest=idfModelTest.transform(featurizedDataTest)
#预测结果
prediction = model.transform(rescaledDataTest)
#用交叉表显示预测结果
prediction.stat.crosstab("label", "prediction").show()
https://www.kaggle.com/aaron7sun/stocknews