scikit-learn 学习笔记-1-加载文本语料库

先上官方文档:
http://scikit-learn.org/stable/user_guide.html
API:
http://scikit-learn.org/stable/modules/classes.html

加载文本语料的方法doc文档为
http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_files.html#sklearn.datasets.load_files

语料库目录结构:

container_folder/
    category_1_folder/
        file_1.txt file_2.txt ... file_42.txt
    category_2_folder/
        file_43.txt file_44.txt ...

源码分析:

def load_files(container_path, description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0):
    #用于记录每个语料对应的标签是哪一个
    target = []
    #存放标签名
    target_names = []
    #存放语料文件路径名
    filenames = []
#获取所有子目录的名,其实就是标签名
    folders = [f for f in sorted(listdir(container_path))
               if isdir(join(container_path, f))]
#限定加载的类的种类
    if categories is not None:
        folders = [f for f in folders if f in categories]
#初始化target_names 和 filenames 列表
    for label, folder in enumerate(folders):
        target_names.append(folder)
        folder_path = join(container_path, folder)
        documents = [join(folder_path, d)
                     for d in sorted(listdir(folder_path))]
        target.extend(len(documents) * [label])
        filenames.extend(documents)

    # 转换为array方便花式检索
    filenames = np.array(filenames)
    target = np.array(target)

    #是否打乱顺序
    if shuffle:
        random_state = check_random_state(random_state)
        indices = np.arange(filenames.shape[0])
        random_state.shuffle(indices)
        filenames = filenames[indices]
        target = target[indices]
#是否加载内容,如果加载则需要指定编码方式,如果不指定则会以byte方式读入
    if load_content:
        data = []
        for filename in filenames:
            with open(filename, 'rb') as f:
                data.append(f.read())
        if encoding is not None:
            data = [d.decode(encoding, decode_error) for d in data]
        return Bunch(data=data,
                     filenames=filenames,
                     target_names=target_names,
                     target=target,
                     DESCR=description)
#最终返回的是Bunch对象
    return Bunch(filenames=filenames,
                 target_names=target_names,
                 target=target,
                 DESCR=description)

你可能感兴趣的:(sklearn)