0%

Kaggle | Medical Cost Personal可视化

发表于 2020-05-24 更新于 2021-03-14 阅读次数： Valine：

数据说明

数据来源：Kaggle
可视化代码来源：EDA l Data Visualization
使用Kaggle在线Jupyter Notebook实现

载入数据

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
print(os.listdir("../input"))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

['insurance']

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings("ignore")

1 2	df = pd.read_csv("../input/insurance/insurance.csv") ## 加载数据 df.head() ## 预览前5行

	age	sex	bmi	children	smoker	region	charges
0	19	female	27.900	0	yes	southwest	16884.92400
1	18	male	33.770	1	no	southeast	1725.55230
2	28	male	33.000	3	no	southeast	4449.46200
3	33	male	22.705	0	no	northwest	21984.47061
4	32	male	28.880	0	no	northwest	3866.85520

1
2
3

print(df.shape)  ## 查看数据的维度
print(df.describe())  ## 查看数据的描述统计
print("Total number of NULL value in the dataset:", df.isnull().sum())  ## 数据集中缺失数据的个数

(1338, 7)
               age          bmi     children       charges
count  1338.000000  1338.000000  1338.000000   1338.000000
mean     39.207025    30.663397     1.094918  13270.422265
std      14.049960     6.098187     1.205493  12110.011237
min      18.000000    15.960000     0.000000   1121.873900
25%      27.000000    26.296250     0.000000   4740.287150
50%      39.000000    30.400000     1.000000   9382.033000
75%      51.000000    34.693750     2.000000  16639.912515
max      64.000000    53.130000     5.000000  63770.428010
Total number of NULL value in the dataset: age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

新建特征

BMI:

Normal: bmi <= 24
OverWeight: 24 < bmi <30
Obese: bmi >= 30

## 按照bmi划分
df['risk_type'] = np.where(df.bmi<=24, "Normal", 
                          (np.where(df.bmi<30, "OverWeight", "Obese")))
## 按照age划分区组
df['Age_group'] = np.where(df.age<25, "Age below 25 year", 
                          (np.where(df.age<35, "Age 25 to 34 year",
                                   (np.where(df.age<55, "Age 35 to 54 year",
                                            (np.where(df.age<75, "Age 55 to 74 year",
                                                     "Age more than 75 year")))))))
df.head()

	age	sex	bmi	children	smoker	region	charges	risk_type	Age_group
0	19	female	27.900	0	yes	southwest	16884.92400	OverWeight	Age below 25 year
1	18	male	33.770	1	no	southeast	1725.55230	Obese	Age below 25 year
2	28	male	33.000	3	no	southeast	4449.46200	Obese	Age 25 to 34 year
3	33	male	22.705	0	no	northwest	21984.47061	Normal	Age 25 to 34 year
4	32	male	28.880	0	no	northwest	3866.85520	OverWeight	Age 25 to 34 year

也可应用Pandas模块中的cut()函数进行分组，具体可见Post not found: python-pandas模块 Pandas模块。

## 分组也可按如下操作
bins1 = [0, 24, 29.99, 100]
df['risk_bins'] = df.cut(df['bmi'], bins1, labels=['Normal', 'OverWeight', 'Obese'])

bins2 = [0, 25, 35, 55, 75, 100]
df['age_bins'] = df.cut(df['age'], bins2, 
                        labels=["Age below 25 year", "Age 25 to 34 year", "Age 35 to 54 year", 
                               "Age 55 to 74 year", "Age more than 75 year"])

1	df["Charge"] = df["charges"]/1000

数据可视化

1 2	## 相关图 sns.pairplot(df, height=1.8)

<seaborn.axisgrid.PairGrid at 0x7f6074b5ced0>

对角线上的是特征的样本数据直方图
非对角线的是两两特征之间的相关图（以两特征分别为横纵轴）
charges是Charge除以1000得到的，二者完全正相关，因此相关图呈一条45度的直线
bmi与age没有什么相关性
不同charges区间中，charges以及Charge与age呈正相关

plt.rcParams["figure.figsize"] = (20, 14)
plt.subplot(421)  ## 将画布裁成4行2列  ## 第1个子图
df['age'].value_counts().sort_index().plot.line(color="k")  ## 折线图
plt.title("Age distribution in the Data")  ## 标题

plt.subplot(422)  ## 第2个子图
df['bmi'].sort_index().plot.hist(color="g")  ## 直方图
plt.title("bmi distribution in the Data")

plt.subplot(423)  ## 第3个子图
df['children'].value_counts().plot.line(color="b")  ## 折线图
plt.title("child distribution in the Data")

plt.subplot(424)
df['charges'].plot.hist(color="c")
plt.title("charges distribution in the Data")

plt.subplot(425)
df["risk_type"].value_counts().plot.bar()  ## 条形图
plt.title("risk_type distribution in the Data")

plt.subplot(426)
df['smoker'].value_counts().plot.bar()
plt.title("smoker distribution in the data")

plt.subplot(427)
df['Age_group'].value_counts().plot.bar()
plt.title("Age_group distribution in the Data")

Text(0.5, 1.0, 'Age_group distribution in the Data')

样本中的年龄主要集中分布在“小于20岁”区间
样本中肥胖的样本较多
样本中，抽烟的样本较少，不抽样的样本较多

plt.rcParams['figure.figsize'] = (18, 8)
plt.subplot(221)  ## 将画布分成2行2列
sns.lineplot(x='age', y='bmi', data=df, color="b")

plt.subplot(222)
sns.lineplot(x='age', y='Charge', data=df, color="g")  ## 折线图

plt.subplot(223)
sns.scatterplot(x="Charge", y="bmi", data=df, color="k")  ## 散点图

<matplotlib.axes._subplots.AxesSubplot at 0x7f6069f93d90>

bmi有随age上升而上升的趋势
charges有随age上升而上升的趋势

下面按年龄段绘制bmi、charges与age的折线图：

plt.rcParams['figure.figsize'] = (18, 8)
plt.subplot(221)
## 横轴是age，纵轴是bmi，按Age_group分段绘制
sns.lineplot(x='age', y='bmi', data=df, hue="Age_group")

plt.subplot(222)
sns.lineplot(x='age', y='Charge', data=df, hue='Age_group')

<matplotlib.axes._subplots.AxesSubplot at 0x7f6069d0fad0>

plt.rcParams["figure.figsize"]=(18,8)
plt.subplot(221)
sns.lineplot(x="age", y="bmi", data=df, color="b", hue="smoker")

plt.subplot(222)
sns.lineplot(x="age", y="Charge", data=df, color="g", hue="smoker")

<matplotlib.axes._subplots.AxesSubplot at 0x7f6069c22150>

不吸烟的人的Charge整体比吸烟的人低

## region
southwest = df[df["region"]=="southwest"]
southeast = df[df["region"]=="southeast"]
northwest = df[df["region"]=="northwest"]
northeast = df[df["region"]=="northeast"]

plt.rcParams["figure.figsize"]=(18,8)
plt.subplot(421)
sns.lineplot(x="age", y="bmi", data=southwest, hue="risk_type", hue_order=["Normal","OverWeight","Obese"])
plt.title("southwest region")

plt.subplot(422)
sns.lineplot(x="bmi", y="Charge", data=southwest, hue="risk_type", hue_order=["Normal","OverWeight","Obese"])
plt.title("southwest region")

plt.subplot(423)
sns.lineplot(x="age", y="bmi", data=southeast, hue="risk_type", hue_order=["Normal","OverWeight","Obese"])
plt.title("southeast region")

plt.subplot(424)
sns.lineplot(x="bmi", y="Charge", data=southeast, hue="risk_type", hue_order=["Normal","OverWeight","Obese"])
plt.title("southeast region")

plt.subplot(425)
sns.lineplot(x="age", y="bmi", data=northwest, hue="risk_type", hue_order=["Normal","OverWeight","Obese"])
plt.title("northwest region")

plt.subplot(426)
sns.lineplot(x="bmi", y="Charge", data=northwest, hue="risk_type", hue_order=["Normal","OverWeight","Obese"])
plt.title("northwest region")

plt.subplot(427)
sns.lineplot(x="age", y="bmi", data=northeast, hue="risk_type", hue_order=["Normal","OverWeight","Obese"])
plt.title("northeast region")

plt.subplot(428)
sns.lineplot(x="bmi", y="Charge", data=northeast, hue="risk_type", hue_order=["Normal","OverWeight","Obese"])
plt.title("northeast region")

Text(0.5, 1.0, 'northeast region')

plt.rcParams["figure.figsize"]=(16,6)
plt.subplot(2,2,1)
sns.violinplot(x="sex", y="bmi", data=df)  ## 小提琴图

plt.subplot(2,2,2)
sns.violinplot(x="smoker", y="bmi", data=df)

plt.subplot(2,2,3)
sns.violinplot(x="risk_type", y="bmi", data=df)

plt.subplot(2,2,4)
sns.violinplot(x="Age_group", y="bmi", data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x7f60692f6590>

plt.rcParams["figure.figsize"]=(16,6)
plt.subplot(2,2,1)
sns.violinplot(x="sex", y="Charge", data=df)

plt.subplot(2,2,2)
sns.violinplot(x="smoker", y="Charge", data=df)

plt.subplot(2,2,3)
sns.violinplot(x="risk_type", y="Charge", data=df)

plt.subplot(2,2,4)
sns.violinplot(x="Age_group", y="Charge", data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x7f6069128bd0>

plt.rcParams["figure.figsize"]=(28,10)
plt.subplot(2,2,1)
sns.violinplot(x="children", y="bmi", data=df, hue="smoker")

plt.rcParams["figure.figsize"]=(28,10)
plt.subplot(2,2,3)
sns.violinplot(x="children", y="Charge", data=df, hue="smoker")

<matplotlib.axes._subplots.AxesSubplot at 0x7f6069054ad0>

Thank you for your approval.

欢迎关注我的其它发布渠道