0%

Kaggle | Medical Cost Personal可视化

数据说明

载入数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
print(os.listdir("../input"))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
['insurance']
1
2
3
4
5
6
7
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings("ignore")
1
2
df = pd.read_csv("../input/insurance/insurance.csv")  ## 加载数据
df.head() ## 预览前5行
age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520
1
2
3
print(df.shape)  ## 查看数据的维度
print(df.describe()) ## 查看数据的描述统计
print("Total number of NULL value in the dataset:", df.isnull().sum()) ## 数据集中缺失数据的个数
(1338, 7)
               age          bmi     children       charges
count  1338.000000  1338.000000  1338.000000   1338.000000
mean     39.207025    30.663397     1.094918  13270.422265
std      14.049960     6.098187     1.205493  12110.011237
min      18.000000    15.960000     0.000000   1121.873900
25%      27.000000    26.296250     0.000000   4740.287150
50%      39.000000    30.400000     1.000000   9382.033000
75%      51.000000    34.693750     2.000000  16639.912515
max      64.000000    53.130000     5.000000  63770.428010
Total number of NULL value in the dataset: age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

新建特征

BMI:

  • Normal: bmi <= 24
  • OverWeight: 24 < bmi <30
  • Obese: bmi >= 30
1
2
3
4
5
6
7
8
9
10
## 按照bmi划分
df['risk_type'] = np.where(df.bmi<=24, "Normal",
(np.where(df.bmi<30, "OverWeight", "Obese")))
## 按照age划分区组
df['Age_group'] = np.where(df.age<25, "Age below 25 year",
(np.where(df.age<35, "Age 25 to 34 year",
(np.where(df.age<55, "Age 35 to 54 year",
(np.where(df.age<75, "Age 55 to 74 year",
"Age more than 75 year")))))))
df.head()
age sex bmi children smoker region charges risk_type Age_group
0 19 female 27.900 0 yes southwest 16884.92400 OverWeight Age below 25 year
1 18 male 33.770 1 no southeast 1725.55230 Obese Age below 25 year
2 28 male 33.000 3 no southeast 4449.46200 Obese Age 25 to 34 year
3 33 male 22.705 0 no northwest 21984.47061 Normal Age 25 to 34 year
4 32 male 28.880 0 no northwest 3866.85520 OverWeight Age 25 to 34 year

也可应用Pandas模块中的cut()函数进行分组,具体可见Post not found: python-pandas模块 Pandas模块

1
2
3
4
5
6
7
8
## 分组也可按如下操作
bins1 = [0, 24, 29.99, 100]
df['risk_bins'] = df.cut(df['bmi'], bins1, labels=['Normal', 'OverWeight', 'Obese'])

bins2 = [0, 25, 35, 55, 75, 100]
df['age_bins'] = df.cut(df['age'], bins2,
labels=["Age below 25 year", "Age 25 to 34 year", "Age 35 to 54 year",
"Age 55 to 74 year", "Age more than 75 year"])
1
df["Charge"] = df["charges"]/1000

数据可视化

1
2
## 相关图
sns.pairplot(df, height=1.8)
<seaborn.axisgrid.PairGrid at 0x7f6074b5ced0>
  • 对角线上的是特征的样本数据直方图
  • 非对角线的是两两特征之间的相关图(以两特征分别为横纵轴)
  • chargesCharge除以1000得到的,二者完全正相关,因此相关图呈一条45度的直线
  • bmiage没有什么相关性
  • 不同charges区间中,charges以及Chargeage呈正相关
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
plt.rcParams["figure.figsize"] = (20, 14)
plt.subplot(421) ## 将画布裁成4行2列 ## 第1个子图
df['age'].value_counts().sort_index().plot.line(color="k") ## 折线图
plt.title("Age distribution in the Data") ## 标题

plt.subplot(422) ## 第2个子图
df['bmi'].sort_index().plot.hist(color="g") ## 直方图
plt.title("bmi distribution in the Data")

plt.subplot(423) ## 第3个子图
df['children'].value_counts().plot.line(color="b") ## 折线图
plt.title("child distribution in the Data")

plt.subplot(424)
df['charges'].plot.hist(color="c")
plt.title("charges distribution in the Data")

plt.subplot(425)
df["risk_type"].value_counts().plot.bar() ## 条形图
plt.title("risk_type distribution in the Data")

plt.subplot(426)
df['smoker'].value_counts().plot.bar()
plt.title("smoker distribution in the data")

plt.subplot(427)
df['Age_group'].value_counts().plot.bar()
plt.title("Age_group distribution in the Data")
Text(0.5, 1.0, 'Age_group distribution in the Data')
  • 样本中的年龄主要集中分布在“小于20岁”区间
  • 样本中肥胖的样本较多
  • 样本中,抽烟的样本较少,不抽样的样本较多
1
2
3
4
5
6
7
8
9
plt.rcParams['figure.figsize'] = (18, 8)
plt.subplot(221) ## 将画布分成2行2列
sns.lineplot(x='age', y='bmi', data=df, color="b")

plt.subplot(222)
sns.lineplot(x='age', y='Charge', data=df, color="g") ## 折线图

plt.subplot(223)
sns.scatterplot(x="Charge", y="bmi", data=df, color="k") ## 散点图
<matplotlib.axes._subplots.AxesSubplot at 0x7f6069f93d90>
  • bmi有随age上升而上升的趋势
  • charges有随age上升而上升的趋势

下面按年龄段绘制bmichargesage的折线图:

1
2
3
4
5
6
7
plt.rcParams['figure.figsize'] = (18, 8)
plt.subplot(221)
## 横轴是age,纵轴是bmi,按Age_group分段绘制
sns.lineplot(x='age', y='bmi', data=df, hue="Age_group")

plt.subplot(222)
sns.lineplot(x='age', y='Charge', data=df, hue='Age_group')
<matplotlib.axes._subplots.AxesSubplot at 0x7f6069d0fad0>
1
2
3
4
5
6
plt.rcParams["figure.figsize"]=(18,8)
plt.subplot(221)
sns.lineplot(x="age", y="bmi", data=df, color="b", hue="smoker")

plt.subplot(222)
sns.lineplot(x="age", y="Charge", data=df, color="g", hue="smoker")
<matplotlib.axes._subplots.AxesSubplot at 0x7f6069c22150>
  • 不吸烟的人的Charge整体比吸烟的人低
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
## region
southwest = df[df["region"]=="southwest"]
southeast = df[df["region"]=="southeast"]
northwest = df[df["region"]=="northwest"]
northeast = df[df["region"]=="northeast"]

plt.rcParams["figure.figsize"]=(18,8)
plt.subplot(421)
sns.lineplot(x="age", y="bmi", data=southwest, hue="risk_type", hue_order=["Normal","OverWeight","Obese"])
plt.title("southwest region")

plt.subplot(422)
sns.lineplot(x="bmi", y="Charge", data=southwest, hue="risk_type", hue_order=["Normal","OverWeight","Obese"])
plt.title("southwest region")

plt.subplot(423)
sns.lineplot(x="age", y="bmi", data=southeast, hue="risk_type", hue_order=["Normal","OverWeight","Obese"])
plt.title("southeast region")

plt.subplot(424)
sns.lineplot(x="bmi", y="Charge", data=southeast, hue="risk_type", hue_order=["Normal","OverWeight","Obese"])
plt.title("southeast region")

plt.subplot(425)
sns.lineplot(x="age", y="bmi", data=northwest, hue="risk_type", hue_order=["Normal","OverWeight","Obese"])
plt.title("northwest region")

plt.subplot(426)
sns.lineplot(x="bmi", y="Charge", data=northwest, hue="risk_type", hue_order=["Normal","OverWeight","Obese"])
plt.title("northwest region")

plt.subplot(427)
sns.lineplot(x="age", y="bmi", data=northeast, hue="risk_type", hue_order=["Normal","OverWeight","Obese"])
plt.title("northeast region")

plt.subplot(428)
sns.lineplot(x="bmi", y="Charge", data=northeast, hue="risk_type", hue_order=["Normal","OverWeight","Obese"])
plt.title("northeast region")
Text(0.5, 1.0, 'northeast region')
1
2
3
4
5
6
7
8
9
10
11
12
plt.rcParams["figure.figsize"]=(16,6)
plt.subplot(2,2,1)
sns.violinplot(x="sex", y="bmi", data=df) ## 小提琴图

plt.subplot(2,2,2)
sns.violinplot(x="smoker", y="bmi", data=df)

plt.subplot(2,2,3)
sns.violinplot(x="risk_type", y="bmi", data=df)

plt.subplot(2,2,4)
sns.violinplot(x="Age_group", y="bmi", data=df)
<matplotlib.axes._subplots.AxesSubplot at 0x7f60692f6590>
1
2
3
4
5
6
7
8
9
10
11
12
plt.rcParams["figure.figsize"]=(16,6)
plt.subplot(2,2,1)
sns.violinplot(x="sex", y="Charge", data=df)

plt.subplot(2,2,2)
sns.violinplot(x="smoker", y="Charge", data=df)

plt.subplot(2,2,3)
sns.violinplot(x="risk_type", y="Charge", data=df)

plt.subplot(2,2,4)
sns.violinplot(x="Age_group", y="Charge", data=df)
<matplotlib.axes._subplots.AxesSubplot at 0x7f6069128bd0>
1
2
3
4
5
6
7
plt.rcParams["figure.figsize"]=(28,10)
plt.subplot(2,2,1)
sns.violinplot(x="children", y="bmi", data=df, hue="smoker")

plt.rcParams["figure.figsize"]=(28,10)
plt.subplot(2,2,3)
sns.violinplot(x="children", y="Charge", data=df, hue="smoker")
<matplotlib.axes._subplots.AxesSubplot at 0x7f6069054ad0>
Thank you for your approval.

欢迎关注我的其它发布渠道