sklearn数据集官方文档地址:https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets
sklearn数据集一览
类型 |
获取方式 |
sklearn生成的随机数据集 |
sklearn.datasets.make_… |
sklearn自带数据集 |
sklearn.datasets.load_… |
sklearn在线下载的数据集 |
sklearn.datasets.fetch_… |
sklearn中加载的svmlight格式的数据集 |
sklearn.datasets.load_svmlight_file(…) |
sklearn在mldata.org在线下载的数据集 |
sklearn.datasets.fetch_mldata(…) |
通过sklearn改变生成随机数据方法的参数,既可以获得用不尽的数据,并且数据的样本数、特征数、标记类别数、噪声数都可以自定义,非常灵活,简单介绍几个sklearn经常使用的生成随机数据的方法。
方法 |
用途 |
make_classification() |
用于分类 |
maek_multilabel_classfication() |
用于多标签分类 |
make_regression() |
用于回归 |
make_blobs() |
用于聚类和分类 |
make_circles() |
用于分类 |
make_moons() |
用于分类 |
参数 |
解释 |
n_features |
特征个数= n_informative() + n_redundant + n_repeated |
n_informative |
多信息特征的个数 |
n_redundant |
冗余信息,informative特征的随机线性组合 |
n_repeated |
重复信息,随机提取n_informative和n_redundant 特征 |
n_classes |
分类类别 |
n_clusters_per_class |
某一个类别是由几个cluster构成的 |
| import numpy as np |
| import pandas as pd |
| import matplotlib.pyplot as plt |
| from matplotlib.font_manager import FontProperties |
| from sklearn import datasets |
| %matplotlib inline |
| font = FontProperties(fname='/Library/Fonts/Heiti.ttc') |
| from sklearn import datasets |
| try: |
| X1, y1 = datasets.make_classification( |
| n_samples=50, n_classes=3, n_clusters_per_class=2, n_informative=2) |
| print(X1.shape) |
| except Exception as e: |
| print('error:{}'.format(e)) |
| |
| |
| |
| error:n_classes * n_clusters_per_class must be smaller or equal 2 ** n_informative |
| import matplotlib.pyplot as plt |
| %matplotlib inline |
| |
| plt.figure(figsize=(10, 10)) |
| |
| plt.subplot(221) |
| plt.title("One informative feature, one cluster per class", fontsize=12) |
| X1, y1 = datasets.make_classification(n_samples=1000, random_state=1, n_features=2, n_redundant=0, n_informative=1, |
| n_clusters_per_class=1) |
| plt.scatter(X1[:, 0], X1[:, 1], marker='*', c=y1) |
| |
| plt.subplot(222) |
| plt.title("Two informative features, one cluster per class", fontsize=12) |
| X1, y1 = datasets.make_classification(n_samples=1000, random_state=1, n_features=2, n_redundant=0, n_informative=2, |
| n_clusters_per_class=1) |
| plt.scatter(X1[:, 0], X1[:, 1], marker='*', c=y1) |
| |
| plt.subplot(223) |
| plt.title("Two informative features, two clusters per class", fontsize=12) |
| X1, y1 = datasets.make_classification( |
| n_samples=1000, random_state=1, n_features=2, n_redundant=0, n_informative=2) |
| plt.scatter(X1[:, 0], X1[:, 1], marker='*', c=y1) |
| |
| plt.subplot(224) |
| plt.title("Multi-class, two informative features, one cluster", |
| fontsize=12) |
| X1, y1 = datasets.make_classification(n_samples=1000, random_state=1, n_features=2, n_redundant=0, n_informative=2, |
| n_clusters_per_class=1, n_classes=4) |
| plt.scatter(X1[:, 0], X1[:, 1], marker='*', c=y1) |
| plt.show() |

| X1, y1 = datasets.make_multilabel_classification( |
| n_samples=1000, n_classes=4, n_features=2, random_state=1) |
| datasets.make_multilabel_classification() |
| |
| print('样本维度:{}'.format(X1.shape)) |
| |
| |
| print(y1[0:5, :]) |
| 样本维度:(1000, 2) |
| [[1 1 0 0] |
| [0 0 0 0] |
| [1 1 0 0] |
| [0 0 0 1] |
| [0 0 0 0]] |
| plt.scatter(X1[:, 0], X1[:, 1], marker='*', c=y1) |
| plt.show() |
| import matplotlib.pyplot as plt |
| %matplotlib inline |
| from sklearn import datasets |
| X1, y1 = datasets.make_regression(n_samples=500, n_features=1, noise=20) |
| plt.scatter(X1, y1, color='r', s=10, marker='*') |
| plt.show() |