# 最经典的一门数据分析案例【CDNow】入门推荐

``````import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
``````
1. 加载数据：首先需要的是加载数据，同时由于数据中缺乏表头，所以需要赋予。且读取时注意更改默认分隔符（数据由多个空格分隔）
``````columns = ['user_id', 'order_dt', 'order_products', 'order_amount']
``````
1. 观察数据：其中需要注意的是，order_dr为日期，格式为int64，并非我们需要的日期格式，所以后续要进行更改。同时一个用户也可以在一天内进行多次购买，如user_id为2的用户就在19970112那天买了两次。
``````df.head()
``````
user_id order_dt order_products order_amount
0 1 19970101 1 11.77
1 2 19970112 1 12.00
2 2 19970112 5 77.00
3 3 19970102 2 20.76
4 3 19970330 2 20.76
``````df.info()
``````
``````<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69659 entries, 0 to 69658
Data columns (total 4 columns):
#   Column          Non-Null Count  Dtype
---  ------          --------------  -----
0   user_id         69659 non-null  int64
1   order_dt        69659 non-null  int64
2   order_products  69659 non-null  int64
3   order_amount    69659 non-null  float64
dtypes: float64(1), int64(3)
memory usage: 2.1 MB
``````
1. 数据处理：没有空值，很干净的数据。现在需要将时间的数据类型进行转换。
``````df['order_date'] = pd.to_datetime(df.order_dt, format='%Y%m%d')
df['month'] = df.order_date.values.astype('datetime64[M]')
``````

%h是小时，%M是分钟，注意和月的大小写不一致，秒是%s。

（也可以是以天或者年来划分）

``````df.head()
``````
user_id order_dt order_products order_amount order_date month
0 1 19970101 1 11.77 1997-01-01 1997-01-01
1 2 19970112 1 12.00 1997-01-12 1997-01-01
2 2 19970112 5 77.00 1997-01-12 1997-01-01
3 3 19970102 2 20.76 1997-01-02 1997-01-01
4 3 19970330 2 20.76 1997-03-30 1997-03-01
``````df.info()
``````
``````<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69659 entries, 0 to 69658
Data columns (total 6 columns):
#   Column          Non-Null Count  Dtype
---  ------          --------------  -----
0   user_id         69659 non-null  int64
1   order_dt        69659 non-null  int64
2   order_products  69659 non-null  int64
3   order_amount    69659 non-null  float64
4   order_date      69659 non-null  datetime64[ns]
5   month           69659 non-null  datetime64[ns]
dtypes: datetime64[ns](2), float64(1), int64(3)
memory usage: 3.2 MB
``````

pandas中有专门的时间序列方法tseries，它可以用来进行时间偏移，也是处理时间类型的好方法。时间格式也能作为索引，在金融、财务等领域使用较多。

``````# df.date - pd.tseries.offsets.MonthBegin(1)
``````
1. 按每笔订单来统计分布由上述可得，用户平均每单购买2.4个商品，标准差为2.3，略有波动。中位数在2.0,75分位数为3，说明绝大多部分订单的购买力 都不多，最大值为99个，数字较高。而购买金额则同购买数量差不多情况，大部分订单都集中在小额。 一般而言，消费类的数据分布，都是长尾形态。大部分用户都是小额，然而小部分用户贡献了收入的大头，俗称二八法则。
``````df.describe()
``````
user_id order_dt order_products order_amount
count 69659.000000 6.965900e+04 69659.000000 69659.000000
mean 11470.854592 1.997228e+07 2.410040 35.893648
std 6819.904848 3.837735e+03 2.333924 36.281942
min 1.000000 1.997010e+07 1.000000 0.000000
25% 5506.000000 1.997022e+07 1.000000 14.490000
50% 11410.000000 1.997042e+07 2.000000 25.980000
75% 17273.000000 1.997111e+07 3.000000 43.700000
max 23570.000000 1.998063e+07 99.000000 1286.010000
1. 上面的消费行为数据粒度是每笔订单，我们转换成每位用户看一下。
``````user_grouped = df.groupby('user_id').sum()
``````
order_dt order_products order_amount
user_id
1 19970101 1 11.77
2 39940224 6 89.00
3 119833602 16 156.46
4 79882233 7 100.50
5 219686137 29 385.61

``````user_grouped.describe()
``````
order_dt order_products order_amount
count 2.357000e+04 23570.000000 23570.000000
mean 5.902627e+07 7.122656 106.080426
std 9.460684e+07 16.983531 240.925195
min 1.997010e+07 1.000000 0.000000
25% 1.997021e+07 1.000000 19.970000
50% 1.997032e+07 3.000000 43.395000
75% 5.992125e+07 7.000000 106.475000
max 4.334408e+09 1033.000000 13990.930000

1. 接下来按月的维度来分析
``````df.groupby('month').order_products.sum().plot()
``````

``````df.groupby('month').order_amount.sum().plot()
``````

1. 至于为什么会出现这种情况？我们假设是用户身上出了问题，早期时间段的用户中有异常值，第二假设是早期有各类促销营销，但这里只有消费数据，所以无法判断。

2. 分析是否存在异常值

8.1 绘制每笔订单的散点图。从图中观察，订单消费金额和订单商品量呈规律性，每个商品十元左右。订单的极值极少，超过1000的较少，显然不是异常波动的罪魁祸首。

``````df.plot.scatter(x = 'order_amount', y = 'order_products')
``````

8.2 绘制用户的散点图，用户也比较健康，而且规律性比订单更强。因为这是CD网站的销售数据，商品比较单一，金额和商品量的关系也因此呈线性，没几个离群点。

``````df.groupby('user_id').sum().plot.scatter(x = 'order_amount', y = 'order_products')
``````

8.3 消费能力特别强的用户有，但是数量不多。为了更好的观察，用直方图。

``````plt.figure(figsize=(12, 4))
plt.subplot(121)
df.order_amount.hist(bins = 30)

plt.subplot(122)
df.groupby('user_id').order_amount.sum().hist(bins = 30)
``````

plt.subplot用于绘制子图，子图用数字参数表示。121表示分成1*2个图片区域，占用第一个，即第一行第一列，122表示占用第二个。figure是尺寸函数，为了容纳两张子图，宽设置的大一点即可。

8.4 观察完用户消费的金额和购买量，接下来看消费的时间节点

``````df.groupby('user_id').month.min().value_counts()
``````
``````1997-02-01    8476
1997-01-01    7846
1997-03-01    7248
Name: month, dtype: int64
``````

``````df.groupby('user_id').month.max().value_counts()
``````
``````1997-02-01    4912
1997-03-01    4478
1997-01-01    4192
1998-06-01    1506
1998-05-01    1042
1998-03-01     993
1998-04-01     769
1997-04-01     677
1997-12-01     620
1997-11-01     609
1998-02-01     550
1998-01-01     514
1997-06-01     499
1997-07-01     493
1997-05-01     480
1997-10-01     455
1997-09-01     397
1997-08-01     384
Name: month, dtype: int64
``````

1. 接下来分析消费中的复购率和回购率。首先将用户消费数据进行数据透视。
``````pivoted_counts = df.pivot_table(index='user_id', columns='month', values='order_dt', aggfunc='count').fillna(0)
columns_month = df.month.sort_values().astype('str').unique()

pivoted_counts.columns = columns_month
``````
1997-01-01 1997-02-01 1997-03-01 1997-04-01 1997-05-01 1997-06-01 1997-07-01 1997-08-01 1997-09-01 1997-10-01 1997-11-01 1997-12-01 1998-01-01 1998-02-01 1998-03-01 1998-04-01 1998-05-01 1998-06-01
user_id
1 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
4 2.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
5 2.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 2.0 1.0 0.0 0.0 0.0 0.0 0.0

``````pivoted_counts_trans = pivoted_counts.applymap(lambda x:1 if x>1 else np.NaN if x==0 else 0)
``````
1997-01-01 1997-02-01 1997-03-01 1997-04-01 1997-05-01 1997-06-01 1997-07-01 1997-08-01 1997-09-01 1997-10-01 1997-11-01 1997-12-01 1998-01-01 1998-02-01 1998-03-01 1998-04-01 1998-05-01 1998-06-01
user_id
1 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 0.0 NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN 0.0 NaN
4 1.0 NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN
5 1.0 0.0 NaN 0.0 0.0 0.0 0.0 NaN 0.0 NaN NaN 1.0 0.0 NaN NaN NaN NaN NaN

applymap针对DataFrame里的所有数据。用lambda进行判断，因为这里涉及了多个结果，所以要两个if else，记住，lambda没有elif的用法。

``````(pivoted_counts_trans.sum() / pivoted_counts_trans.count()).plot(figsize = (10, 4))
``````

1. 接下来计算回购率。回购率是某一个时间窗口内消费的用户，在下一个时间窗口仍旧消费的占比。比方说我1月消费用户为1000,他们中有300个人在2月份依然消费，回购率是30%。

``````pivoted_amount = df.pivot_table(index='user_id', columns='month', values='order_amount', aggfunc='mean').fillna(0)
columns_month = df.month.sort_values().astype('str').unique()

pivoted_amount.columns = columns_month
``````
1997-01-01 1997-02-01 1997-03-01 1997-04-01 1997-05-01 1997-06-01 1997-07-01 1997-08-01 1997-09-01 1997-10-01 1997-11-01 1997-12-01 1998-01-01 1998-02-01 1998-03-01 1998-04-01 1998-05-01 1998-06-01
user_id
1 11.77 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.000 0.000 0.00 0.0 0.0 0.0 0.00 0.0
2 44.50 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.000 0.000 0.00 0.0 0.0 0.0 0.00 0.0
3 20.76 0.0 20.76 19.54 0.00 0.00 0.00 0.00 0.00 0.0 39.205 0.000 0.00 0.0 0.0 0.0 16.99 0.0
4 29.53 0.0 0.00 0.00 0.00 0.00 0.00 14.96 0.00 0.0 0.000 26.480 0.00 0.0 0.0 0.0 0.00 0.0
5 21.65 38.9 0.00 45.55 38.71 26.14 28.14 0.00 40.47 0.0 0.000 43.465 37.47 0.0 0.0 0.0 0.00 0.0

``````pivoted_purchase = pivoted_amount.applymap(lambda x:1 if x>0 else 0)
``````
1997-01-01 1997-02-01 1997-03-01 1997-04-01 1997-05-01 1997-06-01 1997-07-01 1997-08-01 1997-09-01 1997-10-01 1997-11-01 1997-12-01 1998-01-01 1998-02-01 1998-03-01 1998-04-01 1998-05-01 1998-06-01
user_id
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0
4 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0
5 1 1 0 1 1 1 1 0 1 0 0 1 1 0 0 0 0 0

``````def purchase_return(data):
status = []
for i in range(17):
if data[i] == 1:
if data[i+1]==1:
status.append(1)
if data[i+1]==0:
status.append(0)
else:
status.append(np.NaN)
status.append(np.NaN)
return status
pivoted_purchase_return = pivoted_purchase.apply(purchase_return, axis=1, result_type='expand')
pivoted_purchase_return.columns = columns_month
``````
1997-01-01 1997-02-01 1997-03-01 1997-04-01 1997-05-01 1997-06-01 1997-07-01 1997-08-01 1997-09-01 1997-10-01 1997-11-01 1997-12-01 1998-01-01 1998-02-01 1998-03-01 1998-04-01 1998-05-01 1998-06-01
user_id
1 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 0.0 NaN 1.0 0.0 NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN 0.0 NaN
4 0.0 NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN
5 1.0 0.0 NaN 1.0 1.0 1.0 0.0 NaN 0.0 NaN NaN 1.0 0.0 NaN NaN NaN NaN NaN

``````(pivoted_purchase_return.sum()/pivoted_purchase_return.count()).plot(figsize = (10, 4))
``````

1. 接下来进行用户分层，我们按照用户的消费行为，简单划分几个维度：新用户、活跃用户 、不活跃用户和回流用户。

``````def state_return(data):
status = []
for i in range(18):
if data[i] == 0:
if len(status) == 0:
status.append('unreg')
else:
if status[i-1] == 'unreg':
status.append('unreg')
else:
status.append('unact')
else:
if len(status) == 0:
status.append('new')
else:
if status[i-1] == 'unreg':
status.append('new')
elif status[i-1] == 'unact':
status.append('return')
else:
status.append('act')
return status

``````

``````pivoted_purchase_status = pivoted_purchase.apply(state_return, axis = 1, result_type='expand')
pivoted_purchase_status.columns = columns_month
``````
1997-01-01 1997-02-01 1997-03-01 1997-04-01 1997-05-01 1997-06-01 1997-07-01 1997-08-01 1997-09-01 1997-10-01 1997-11-01 1997-12-01 1998-01-01 1998-02-01 1998-03-01 1998-04-01 1998-05-01 1998-06-01
user_id
1 new unact unact unact unact unact unact unact unact unact unact unact unact unact unact unact unact unact
2 new unact unact unact unact unact unact unact unact unact unact unact unact unact unact unact unact unact
3 new unact return act unact unact unact unact unact unact return unact unact unact unact unact return unact
4 new unact unact unact unact unact unact return unact unact unact return unact unact unact unact unact unact
5 new act unact return act act act unact return unact unact return act unact unact unact unact unact

``````pivoted_status_counts = pivoted_purchase_status.replace('unreg', np.NaN).apply(lambda x:pd.value_counts(x))
``````
1997-01-01 1997-02-01 1997-03-01 1997-04-01 1997-05-01 1997-06-01 1997-07-01 1997-08-01 1997-09-01 1997-10-01 1997-11-01 1997-12-01 1998-01-01 1998-02-01 1998-03-01 1998-04-01 1998-05-01 1998-06-01
act NaN 1155.0 1680 1773.0 852.0 747.0 746.0 604.0 528.0 532.0 624 632.0 512.0 472.0 569.0 517.0 458.0 446.0
new 7814.0 8455.0 7231 NaN NaN NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN NaN NaN NaN
return NaN NaN 595 1049.0 1362.0 1592.0 1434.0 1168.0 1211.0 1307.0 1402 1232.0 1025.0 1079.0 1489.0 919.0 1030.0 1060.0
unact NaN 6659.0 13994 20678.0 21286.0 21161.0 21320.0 21728.0 21761.0 21661.0 21474 21638.0 21965.0 21951.0 21444.0 22066.0 22014.0 21996.0

``````pivoted_status_counts.fillna(0).T.plot.area(figsize = (12, 6))
``````

``````return_rata = pivoted_status_counts.apply(lambda x:x / x.sum(), axis = 1)
return_rata.loc['return'].plot(figsize = (12, 6))
``````

``````return_rata.loc['act'].plot(figsize = (12, 6))
``````

1. 接下来分析用户质量，因为消费行为有明显的二八倾向，我们需要知道高质量用户为消费贡献了多少份额。
``````user_amount = df.groupby('user_id').order_amount.sum().sort_values().reset_index()
user_amount['amount_cumsum'] = user_amount.order_amount.cumsum()
user_amount.tail()
``````
user_id order_amount amount_cumsum
23565 7931 6497.18 2463822.60
23566 19339 6552.70 2470375.30
23567 7983 6973.07 2477348.37
23568 14048 8976.33 2486324.70
23569 7592 13990.93 2500315.63

``````total_amount = user_amount.amount_cumsum.max()
user_amount['prop'] = user_amount.amount_cumsum.apply(lambda x:x / total_amount)
user_amount.tail()
``````
user_id order_amount amount_cumsum prop
23565 7931 6497.18 2463822.60 0.985405
23566 19339 6552.70 2470375.30 0.988025
23567 7983 6973.07 2477348.37 0.990814
23568 14048 8976.33 2486324.70 0.994404
23569 7592 13990.93 2500315.63 1.000000

``````user_amount.prop.plot()
``````

``````user_counts = df.groupby('user_id').order_dt.count().sort_values().reset_index()
user_counts['counts_cumsum'] = user_counts.order_dt.cumsum()

total_counts = user_counts.counts_cumsum.max()
user_counts['prop'] = user_counts.counts_cumsum.apply(lambda x:x / total_counts)
user_counts.prop.plot()
``````

1. 接下来计算用户的生命周期，这里定义第一次消费至最后一次消费为整个用户生命。
``````user_purchase = df[['user_id', 'order_products', 'order_amount', 'order_date']]
order_date_min = user_purchase.groupby('user_id').order_date.min()
order_date_max = user_purchase.groupby('user_id').order_date.max()
``````
``````user_id
1      0 days
2      0 days
3    511 days
4    345 days
5    367 days
6      0 days
7    445 days
8    452 days
9    523 days
10     0 days
Name: order_date, dtype: timedelta64[ns]
``````

``````(order_date_max - order_date_min).describe()
``````
``````count                          23570
mean     134 days 20:55:36.987696224
std      180 days 13:46:43.039788104
min                  0 days 00:00:00
25%                  0 days 00:00:00
50%                  0 days 00:00:00
75%                294 days 00:00:00
max                544 days 00:00:00
Name: order_date, dtype: object
``````

``````((order_date_max - order_date_min) / np.timedelta64(1, 'D')).hist(bins = 15)
``````

``````life_time = (order_date_max - order_date_min).reset_index() # 转为dataframe
``````
user_id order_date
0 1 0 days
1 2 0 days
2 3 511 days
3 4 345 days
4 5 367 days
``````life_time['life_time'] = life_time.order_date / np.timedelta64(1, 'D')
life_time[life_time.life_time > 0].life_time.hist(bins=100, figsize=(12, 6))
``````

``````life_time[life_time.life_time > 400].life_time.count() / life_time[life_time.life_time > 0].life_time.count()
``````
``````0.31703716568252865
``````

``````life_time[life_time.life_time > 0].life_time.mean()
``````
``````276.0448072247308
``````

1. 再来计算留存率，留存率也是消费分析领域的经典应用。它指用户在第一次消费后，有多少比率进行第二次消费。和回流率的区别是留存率倾向于计算第一次消费，并且有多个时间窗口。
``````user_purchase_retention = pd.merge(left=user_purchase, right=order_date_min.reset_index(), how='inner', on='user_id',suffixes=('', '_min'))
``````
user_id order_products order_amount order_date order_date_min
0 1 1 11.77 1997-01-01 1997-01-01
1 2 1 12.00 1997-01-12 1997-01-12
2 2 5 77.00 1997-01-12 1997-01-12
3 3 2 20.76 1997-01-02 1997-01-02
4 3 2 20.76 1997-03-30 1997-01-02

``````user_purchase_retention['order_date_diff'] = user_purchase_retention.order_date - user_purchase_retention.order_date_min
``````
user_id order_products order_amount order_date order_date_min order_date_diff
0 1 1 11.77 1997-01-01 1997-01-01 0 days
1 2 1 12.00 1997-01-12 1997-01-12 0 days
2 2 5 77.00 1997-01-12 1997-01-12 0 days
3 3 2 20.76 1997-01-02 1997-01-02 0 days
4 3 2 20.76 1997-03-30 1997-01-02 87 days

``````date_trans = lambda x: x/np.timedelta64(1, 'D')

user_purchase_retention['date_diff'] = user_purchase_retention.order_date_diff.apply(date_trans)
``````
user_id order_products order_amount order_date order_date_min order_date_diff date_diff
0 1 1 11.77 1997-01-01 1997-01-01 0 days 0.0
1 2 1 12.00 1997-01-12 1997-01-12 0 days 0.0
2 2 5 77.00 1997-01-12 1997-01-12 0 days 0.0
3 3 2 20.76 1997-01-02 1997-01-02 0 days 0.0
4 3 2 20.76 1997-03-30 1997-01-02 87 days 87.0
5 3 2 19.54 1997-04-02 1997-01-02 90 days 90.0
6 3 5 57.45 1997-11-15 1997-01-02 317 days 317.0
7 3 4 20.96 1997-11-25 1997-01-02 327 days 327.0
8 3 1 16.99 1998-05-28 1997-01-02 511 days 511.0
9 4 2 29.33 1997-01-01 1997-01-01 0 days 0.0
``````bin = [0,3,7,15,30,60,90,180,365]
user_purchase_retention['date_diff_bin'] = pd.cut(user_purchase_retention.date_diff, bins = bin)
``````
user_id order_products order_amount order_date order_date_min order_date_diff date_diff date_diff_bin
0 1 1 11.77 1997-01-01 1997-01-01 0 days 0.0 NaN
1 2 1 12.00 1997-01-12 1997-01-12 0 days 0.0 NaN
2 2 5 77.00 1997-01-12 1997-01-12 0 days 0.0 NaN
3 3 2 20.76 1997-01-02 1997-01-02 0 days 0.0 NaN
4 3 2 20.76 1997-03-30 1997-01-02 87 days 87.0 (60.0, 90.0]
5 3 2 19.54 1997-04-02 1997-01-02 90 days 90.0 (60.0, 90.0]
6 3 5 57.45 1997-11-15 1997-01-02 317 days 317.0 (180.0, 365.0]
7 3 4 20.96 1997-11-25 1997-01-02 327 days 327.0 (180.0, 365.0]
8 3 1 16.99 1998-05-28 1997-01-02 511 days 511.0 NaN
9 4 2 29.33 1997-01-01 1997-01-01 0 days 0.0 NaN
10 4 2 29.73 1997-01-18 1997-01-01 17 days 17.0 (15.0, 30.0]
11 4 1 14.96 1997-08-02 1997-01-01 213 days 213.0 (180.0, 365.0]
12 4 2 26.48 1997-12-12 1997-01-01 345 days 345.0 (180.0, 365.0]
13 5 2 29.33 1997-01-01 1997-01-01 0 days 0.0 NaN
14 5 1 13.97 1997-01-14 1997-01-01 13 days 13.0 (7.0, 15.0]
``````pivoted_retention = user_purchase_retention.pivot_table(index='user_id', columns='date_diff_bin', values='order_amount', aggfunc=sum)
``````
date_diff_bin (0, 3] (3, 7] (7, 15] (15, 30] (30, 60] (60, 90] (90, 180] (180, 365]
user_id
1 0.0 0.0 0.00 0.00 0.00 0.0 0.00 0.00
2 0.0 0.0 0.00 0.00 0.00 0.0 0.00 0.00
3 0.0 0.0 0.00 0.00 0.00 40.3 0.00 78.41
4 0.0 0.0 0.00 29.73 0.00 0.0 0.00 41.44
5 0.0 0.0 13.97 0.00 38.90 0.0 110.40 155.54
6 0.0 0.0 0.00 0.00 0.00 0.0 0.00 0.00
7 0.0 0.0 0.00 0.00 0.00 0.0 0.00 97.43
8 0.0 0.0 0.00 0.00 13.97 0.0 45.29 104.17
9 0.0 0.0 0.00 0.00 0.00 0.0 30.33 0.00
10 0.0 0.0 0.00 0.00 0.00 0.0 0.00 0.00

``````pivoted_retention.replace(0,np.nan).mean()
``````
``````date_diff_bin
(0, 3]        35.905798
(3, 7]        36.385121
(7, 15]       42.669895
(15, 30]      45.986198
(30, 60]      50.215070
(60, 90]      48.975277
(90, 180]     67.223297
(180, 365]    91.960059
dtype: float64
``````

``````pivoted_retention_trans = pivoted_retention.applymap(lambda x: 1 if x>0 else 0)
``````
date_diff_bin (0, 3] (3, 7] (7, 15] (15, 30] (30, 60] (60, 90] (90, 180] (180, 365]
user_id
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 1 0 1
4 0 0 0 1 0 0 0 1
5 0 0 1 0 1 0 1 1

``````(pivoted_retention_trans.sum() / pivoted_retention_trans.count()).plot.bar()
``````

1. 怎么算放长线钓大鱼呢？我们计算出用户的平均购买周期。
``````def diff(group):
d = group.date_diff.shift(-1) - group.date_diff
return d

last_diff = user_purchase_retention.groupby('user_id').apply(diff)
``````
``````user_id
1        0      NaN
2        1      0.0
2      NaN
3        3     87.0
4      3.0
5    227.0
6     10.0
7    184.0
8      NaN
4        9     17.0
Name: date_diff, dtype: float64
``````

``````last_diff.mean()
``````
``````68.97376814424265
``````

``````last_diff.hist(bins = 20)
``````

?‍?‍?‍? 机器学习：分享机器学习实战项目和常用模型讲解
?‍?‍?‍? 数据分析：分享数据分析实战项目和常用技能整理

CSDN@报告，今天也有好好学习

THE END