# 一、有关上采样和Smote算法？

**

## 1、上采样

**

``````from sklearn.datasets import make_classification
from collections import Counter
X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
n_redundant=0, n_repeated=0, n_classes=3,
n_clusters_per_class=1,
weights=[0.01, 0.05, 0.94],
class_sep=0.8, random_state=0)
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=0)
print("采样之前的样本个数为:")
print(sorted(Counter(y).items()))
X_resampled, y_resampled = ros.fit_resample(X, y)
print("采样之后的样本个数为:")
print(sorted(Counter(y_resampled).items()))

``````
``````采样之前的样本个数为:
[(0, 64), (1, 262), (2, 4674)]

[(0, 4674), (1, 4674), (2, 4674)]
``````

X就是一个二维列表，应该是样本的特征，而y就是样本具体的值了，这个生成的数据集中共有三类，包括2、1、0。但是可以看出2是多数类，0比较少。而上采样的做法就是生成了很多0类和1类使之与多数的2类样本个数一样。

``````import numpy as np
X_hetero = np.array([['xxx', 1, 1.0], ['yyy', 2, 2.0], ['zzz', 3, 3.0]],
dtype=object)
y_hetero = np.array([0, 0, 1])
X_resampled, y_resampled = ros.fit_resample(X_hetero, y_hetero)
print(X_resampled)
print(y_resampled)
``````

``````[['xxx' 1 1.0]
['yyy' 2 2.0]
['zzz' 3 3.0]
['zzz' 3 3.0]]
[0 0 1 1]
``````

## 2、SMOTE算法

``````from sklearn.datasets import make_classification
from collections import Counter
X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
n_redundant=0, n_repeated=0, n_classes=3,
n_clusters_per_class=1,
weights=[0.01, 0.05, 0.94],
class_sep=0.8, random_state=0)
print("采样之前的样本数为：")
print(Counter(y))
from imblearn.over_sampling import SMOTE
sampling_strategy={0:3000,1:3000,2:4674}
oversampler=SMOTE(sampling_strategy=sampling_strategy,random_state=0,k_neighbors=2,n_jobs=1)

X_resampled, y_resampled = oversampler.fit_resample(X, y)
print("采样后的样本数为：")
print(Counter(y_resampled))

``````
``````采样之前的样本数为：
Counter({2: 4674, 1: 262, 0: 64})

Counter({2: 4674, 1: 3000, 0: 3000})
``````

‘majority’：resample only the majority class； 仅仅重新采样多数类
‘not minority’：resample all classes but the minority class； 重采样所有类别除了少数类
‘not majority’：resample all classes but the majority class； 重采样所有类别除了多数类
‘all’：resample all classes； 重采样所有类别
‘auto’：equivalent to ‘not minority’。 等价于not minority

# 二、如果是使用真实的数据集（类似文本分类任务？）

``````for col in ["ID"]:
le = LabelEncoder()
le.fit(df[col])
row = le.transform(df[col])
df[col] = row + index
index = max(row) + 1
``````

# 参考

https://www.jianshu.com/p/3eac447b7261

THE END