阿里巴巴天池机器学习(数据分析达人赛3:汽车产品聚类分析)

赛题背景

赛题以竞品分析为背景,通过数据的聚类,为汽车提供聚类分类。对于指定的车型,可以通过聚类分析找到其竞品车型。通过这道赛题,鼓励学习者利用车型数据,进行车型画像的分析,为产品的定位,竞品分析提供数据决策。

赛题数据

数据源:car_price.csv,数据包括了205款车的26个字段

1 Car_ID Unique id of each observation (Interger)
2 Symboling Its assigned insurance risk rating, A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.(Categorical)
3 carCompany Name of car company (Categorical)
4 fueltype Car fuel type i.e gas or diesel (Categorical)
5 aspiration Aspiration used in a car (Categorical)
6 doornumber Number of doors in a car (Categorical)
7 carbody body of car (Categorical)
8 drivewheel type of drive wheel (Categorical)
9 enginelocation Location of car engine (Categorical)
10 wheelbase Weelbase of car (Numeric)
11 carlength Length of car (Numeric)
12 carwidth Width of car (Numeric)
13 carheight height of car (Numeric)
14 curbweight The weight of a car without occupants or baggage. (Numeric)
15 enginetype Type of engine. (Categorical)
16 cylindernumber cylinder placed in the car (Categorical)
17 enginesize Size of car (Numeric)
18 fuelsystem Fuel system of car (Categorical)
19 boreratio Boreratio of car (Numeric)
20 stroke Stroke or volume inside the engine (Numeric)
21 compressionratio compression ratio of car (Numeric)
22 horsepower Horsepower (Numeric)
23 peakrpm car peak rpm (Numeric)
24 citympg Mileage in city (Numeric)
25 highwaympg Mileage on highway (Numeric)
26 price(Dependent variable) Price of car (Numeric)

赛题任务

选手需要对该汽车数据进行聚类分析,并找到vokswagen汽车的相应竞品。要求选手在天池实验室中用notebook完成以上任务,并分享到比赛论坛。
(聚类分析是常用的数据分析方法之一,不仅可以帮助我们对用户进行分组,还可以帮我们对产品进行分组(比如竞品分析) 这里的聚类个数选手可以根据数据集的特点自己指定,并说明聚类的依据)

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing

导入数据

df=pd.read_csv("car_price.csv")

提取数据中值为字符串的列,,,注意include里是O,不是0!!!

cols = df.select_dtypes(include='O').columns
df2 = df.copy()

转换数值标签

for col in cols:
    le = LabelEncoder()
    df2[col] = le.fit_transform(df[col])
df2

数据缩放

scaler = preprocessing.MinMaxScaler()

注意是df2.loc[:],不能是df2
df2.loc[:] = scaler.fit_transform(df2)
 

columns = ['symboling',  'fueltype', 'aspiration',
       'doornumber', 'carbody', 'drivewheel', 'enginelocation', 'wheelbase',
       'carlength', 'carwidth', 'carheight', 'curbweight', 'enginetype',
       'cylindernumber', 'enginesize', 'fuelsystem', 'boreratio', 'stroke',
       'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg',
       'price']

kmeans= KMeans(n_clusters=10)

kmeans.fit(df2[columns])
y_pred = kmeans.predict(df2[columns])

df['result'] = y_pred
df
 

选择无重复的值,便于查看

df['CarName'].unique()
name = 'volkswagen rabbit'

提取标签一样的数据作为我们的汽车竞品

label = df[df['CarName']==name]['result'].values[0]

df[df['result']==label]

本图文内容来源于网友网络收集整理提供,作为学习参考使用,版权属于原作者。
THE END
分享
二维码
< <上一篇
下一篇>>