Prediction of Red Wine Quality

Robin Dong 2018-08-10 14:22

In Kaggle platform, there is an example dataset about Quality of Red Wine. I wrote some code for it by using scikit-learn and pandas:

import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

# Read dataset
wine = pd.read_csv('~/Downloads/winequality-red.csv', sep = ';')
attrs = wine.drop(['quality'], axis = 1)
header = list(attrs)
attrs = attrs.values

# Use scaler to normalize data
scaler = StandardScaler()
scaled_attrs = scaler.fit_transform(attrs)

quality = wine['quality'].values

# SVM classifier
svr = SVC(kernel = 'rbf', max_iter = -1)
svr.fit(attrs, quality)

# Randomized decison trees classifier
dt = ExtraTreesClassifier()
dt.fit(attrs, quality)

ls = list(zip(dt.feature_importances_, header))
ls.sort(key = lambda x: x[1])
for importance, name in ls:
    print(name, importance)

print('\n\n')

# Cross validation on this two classifiers
for reg in [svr, dt]:
    scores = cross_val_score(reg, attrs, quality, scoring = 'neg_mean_squared_error', cv = 10)
    rmse = -scores
    print(reg)
    print(rmse.mean(), rmse.std())
    print('\n')

The results reported by snippet above:

alcohol 0.1438906634767823
chlorides 0.07953780339531004
citric acid 0.07979101058207233
density 0.0846765183778148
fixed acidity 0.07686725880938272
free sulfur dioxide 0.07178658192019563
pH 0.07797509374376276
residual sugar 0.0796105749270121
sulphates 0.11872569296381115
total sulfur dioxide 0.0993798893196299
volatile acidity 0.08775891248422625



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
0.6983420378445301 0.04803296683789781


ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

Looks the most important feature to predict quality of red wine is ‘alcohol’. Intuitively, right?

[返回] [原文链接]