# 如何用Python和机器学习炒股赚钱？

「星期天《纽约时报》上发表的一篇关于癌症治疗新药开发潜力的文章导致 EntreMed 的股价从周五收盘时的 12.063 飙升至 85，在周一收盘时接近 52。在接下来的三周，它的收盘价都在 30 以上。这股投资热情也让其它生物科技股得到了溢价。但是，这个癌症研究方面的可能突破在至少五个月前就已经被 Nature 期刊和各种流行的报纸报道过了，其中甚至包括《泰晤士报》！因此，仅仅是热情的公众关注就能引发股价的持续上涨，即便实际上并没有出现真正的新信息。」

「（股价）运动可能会集中于有一些共同之处的股票上，但这些共同之处不一定要是经济基础。」

``````import numpy as npimport pandas as pdfrom sklearn.decomposition import PCAfrom sklearn.cluster import KMeansimport matplotlib.pyplot as pltimport seaborn as sb

np.seterr(divide='ignore', invalid='ignore')# Quick way to test just a few column features# stocks = pd.read_csv('supercolumns-elements-nasdaq-nyse-otcbb-general-UPDATE-2017-03-01.csv', usecols=range(1,16))stocks = pd.read_csv('supercolumns-elements-nasdaq-nyse-otcbb-general-UPDATE-2017-03-01.csv')

str_list = []for colname, colvalue in stocks.iteritems():    if type(colvalue[1]) == str:
str_list.append(colname)# Get to the numeric columns by inversionnum_list = stocks.columns.difference(str_list)

stocks_num = stocks[num_list]

``````stocks_num = stocks_num.fillna(value=0, axis=1)

X = stocks_num.valuesfrom sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(X)

f, ax = plt.subplots(figsize=(12, 10))
plt.title('Pearson Correlation of Concept Features (Elements & Minerals)')# Draw the heatmap using seabornsb.heatmap(stocks_num.astype(float).corr(),linewidths=0.25,vmax=1.0, square=True, cmap="YlGnBu", linecolor='black', annot=True)
sb.plt.show()``````

## 测量「已解释方差（Explained Variance）」和主成分分析（PCA）

``````# Calculating Eigenvectors and eigenvalues of Cov matirxmean_vec = np.mean(X_std, axis=0)
cov_mat = np.cov(X_std.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)# Create a list of (eigenvalue, eigenvector) tupleseig_pairs = [ (np.abs(eig_vals[i]),eig_vecs[:,i]) for i in range(len(eig_vals))]# Sort from high to loweig_pairs.sort(key = lambda x: x[0], reverse= True)# Calculation of Explained Variance from the eigenvaluestot = sum(eig_vals)
var_exp = [(i/tot)*100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp) # Cumulative explained variance# Variances plotmax_cols = len(stocks.columns) - 1plt.figure(figsize=(10, 5))
plt.bar(range(max_cols), var_exp, alpha=0.3333, align='center', label='individual explained variance', color = 'g')
plt.step(range(max_cols), cum_var_exp, where='mid',label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc='best')
plt.show()``````

``````pca = PCA(n_components=9)
x_9d = pca.fit_transform(X_std)

plt.figure(figsize = (9,7))
plt.scatter(x_9d[:,0],x_9d[:,1], c='goldenrod',alpha=0.5)
plt.ylim(-10,30)
plt.show()``````

## K-均值聚类（K-Means Clustering）

``````# Set a 3 KMeans clusteringkmeans = KMeans(n_clusters=3)# Compute cluster centers and predict cluster indicesX_clustered = kmeans.fit_predict(x_9d)# Define our own color mapLABEL_COLOR_MAP = {0 : 'r',1 : 'g',2 : 'b'}
label_color = [LABEL_COLOR_MAP[l] for l in X_clustered]# Plot the scatter digramplt.figure(figsize = (7,7))
plt.scatter(x_9d[:,0],x_9d[:,2], c= label_color, alpha=0.5)
plt.show()``````

``````# Create a temp dataframe from our PCA projection data "x_9d"df = pd.DataFrame(x_9d)
df = df[[0,1,2]]
df['X_cluster'] = X_clustered# Call Seaborn's pairplot to visualize our KMeans clustering on the PCA projected datasb.pairplot(df, hue='X_cluster', palette='Dark2', diag_kind='kde', size=1.85)
sb.plt.show()``````