Auto Byte

Science AI

# 问题阐述

```from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.1)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)```

```from sklearn.preprocessing import StandardScaler,MinMaxScaler
sc = StandardScaler()
X_transform = sc.fit_transform(X)
X_train,X_test,y_train,y_test = train_test_split(X_transform,y,test_size=0.1)```

# 论点阐述

#### 先数据拆分再特征缩放先数据缩放再数据拆分

1. 先fit获得相应的参数值（可以理解为获得特征缩放规则）
2. 再用transform进行转换

fit_transform方法就是先执行fit()方法再执行transform()方法，所以每执行一次就会采用新的特征缩放规则，我们可以将训练集的特征缩放规则应用到测试集上，可以将测试集的特征缩放规则应用到训练集上(不过一般很少这么做)，但是通过全部数据集(训练集+测试集)fit到的的特征缩放规则是没有模型训练意义的。

• 按照先数据拆分再特征缩放的做法是：先将旧花数据fit出特征缩放规则，接着将其transform到新花数据上，接着对应用旧花数据特征缩放规则的新花数据进行预测分类；
• 按照先数据缩放再数据拆分的做法是：将新旧花数据合并为一个总数据集，接着对总数据集进行fit_transform操作，最后再把新花数据切分出来进行预测分类；

# 总结

```>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> X, y = iris.data, iris.target
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
>>> quantile_transformer = preprocessing.QuantileTransformer(random_state=0)
>>> X_train_trans = quantile_transformer.fit_transform(X_train)
>>> X_test_trans = quantile_transformer.transform(X_test)
>>> np.percentile(X_train[:, 0], [0, 25, 50, 75, 100])
array([ 4.3,  5.1,  5.8,  6.5,  7.9])```

MarketAxess・量化研究院
Hi wengJJ, This is a popular top under discussion. Indeed separating the data first and then do normalization is more reasonable. But sometimes we indeed use all sample to fit Imputer, especially for some embedding task. In this case we can not only include more scenario in training, but also inject some noise in training to avoid over-fitting. What do you think? Thanks, Xiang

Thank you very much for your message. I don't know much about the embedding task, so I emphasized at the beginning of the article that "this article expounds the sequence of data splitting and feature scaling from the aspect of machine learning-data mining". Therefore, I can only say from the Imputer of sklearn that the Impter of sklearn is aimed at "Missing Data", which data does not exist in the beginning. This step is to create data, unlike the feature scaling step which is to modify the original data. The data created can be Shared by all, so it is possible to use the entire sample. The Imputer step is in the process of data cleaning. In the traditional data mining process, the process of data cleaning is before the feature scaling, so it does not conflict with my conclusion.
MarketAxess・量化研究院

That makes sense! And for embedding I am referring to something like word2vec, where we usually include all data to train the LabelEncoder so that we can consider all the words for embedding. But yes it doesn't conflict with your theory.
pipeline在机器学习领域到底指什么？刚进入这个行业，但是之前在书本上学习时从没见这个词的使用。大佬出来解释下，谢谢。