阿里巴巴天池机器学习(数据分析达人赛3:汽车产品聚类分析)
赛题背景
赛题以竞品分析为背景,通过数据的聚类,为汽车提供聚类分类。对于指定的车型,可以通过聚类分析找到其竞品车型。通过这道赛题,鼓励学习者利用车型数据,进行车型画像的分析,为产品的定位,竞品分析提供数据决策。
赛题数据
数据源:car_price.csv,数据包括了205款车的26个字段
1 | Car_ID | Unique id of each observation (Interger) |
---|---|---|
2 | Symboling | Its assigned insurance risk rating, A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.(Categorical) |
3 | carCompany | Name of car company (Categorical) |
4 | fueltype | Car fuel type i.e gas or diesel (Categorical) |
5 | aspiration | Aspiration used in a car (Categorical) |
6 | doornumber | Number of doors in a car (Categorical) |
7 | carbody | body of car (Categorical) |
8 | drivewheel | type of drive wheel (Categorical) |
9 | enginelocation | Location of car engine (Categorical) |
10 | wheelbase | Weelbase of car (Numeric) |
11 | carlength | Length of car (Numeric) |
12 | carwidth | Width of car (Numeric) |
13 | carheight | height of car (Numeric) |
14 | curbweight | The weight of a car without occupants or baggage. (Numeric) |
15 | enginetype | Type of engine. (Categorical) |
16 | cylindernumber | cylinder placed in the car (Categorical) |
17 | enginesize | Size of car (Numeric) |
18 | fuelsystem | Fuel system of car (Categorical) |
19 | boreratio | Boreratio of car (Numeric) |
20 | stroke | Stroke or volume inside the engine (Numeric) |
21 | compressionratio | compression ratio of car (Numeric) |
22 | horsepower | Horsepower (Numeric) |
23 | peakrpm | car peak rpm (Numeric) |
24 | citympg | Mileage in city (Numeric) |
25 | highwaympg | Mileage on highway (Numeric) |
26 | price(Dependent variable) | Price of car (Numeric) |
赛题任务
选手需要对该汽车数据进行聚类分析,并找到vokswagen汽车的相应竞品。要求选手在天池实验室中用notebook完成以上任务,并分享到比赛论坛。
(聚类分析是常用的数据分析方法之一,不仅可以帮助我们对用户进行分组,还可以帮我们对产品进行分组(比如竞品分析) 这里的聚类个数选手可以根据数据集的特点自己指定,并说明聚类的依据)
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
导入数据
df=pd.read_csv("car_price.csv")
提取数据中值为字符串的列,,,注意include里是O,不是0!!!
cols = df.select_dtypes(include='O').columns
df2 = df.copy()
转换数值标签
for col in cols:
le = LabelEncoder()
df2[col] = le.fit_transform(df[col])
df2
数据缩放
scaler = preprocessing.MinMaxScaler()
注意是df2.loc[:],不能是df2
df2.loc[:] = scaler.fit_transform(df2)
columns = ['symboling', 'fueltype', 'aspiration',
'doornumber', 'carbody', 'drivewheel', 'enginelocation', 'wheelbase',
'carlength', 'carwidth', 'carheight', 'curbweight', 'enginetype',
'cylindernumber', 'enginesize', 'fuelsystem', 'boreratio', 'stroke',
'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg',
'price']
kmeans= KMeans(n_clusters=10)
kmeans.fit(df2[columns])
y_pred = kmeans.predict(df2[columns])
df['result'] = y_pred
df
选择无重复的值,便于查看
df['CarName'].unique()
name = 'volkswagen rabbit'
提取标签一样的数据作为我们的汽车竞品
label = df[df['CarName']==name]['result'].values[0]
df[df['result']==label]