数据说明
- 数据来源:Kaggle
- 可视化代码来源:EDA l Data Visualization
- 使用Kaggle在线Jupyter Notebook实现
载入数据
1 | # This Python 3 environment comes with many helpful analytics libraries installed |
['insurance']
1 | import numpy as np |
1 | df = pd.read_csv("../input/insurance/insurance.csv") ## 加载数据 |
age | sex | bmi | children | smoker | region | charges | |
---|---|---|---|---|---|---|---|
0 | 19 | female | 27.900 | 0 | yes | southwest | 16884.92400 |
1 | 18 | male | 33.770 | 1 | no | southeast | 1725.55230 |
2 | 28 | male | 33.000 | 3 | no | southeast | 4449.46200 |
3 | 33 | male | 22.705 | 0 | no | northwest | 21984.47061 |
4 | 32 | male | 28.880 | 0 | no | northwest | 3866.85520 |
1 | print(df.shape) ## 查看数据的维度 |
(1338, 7)
age bmi children charges
count 1338.000000 1338.000000 1338.000000 1338.000000
mean 39.207025 30.663397 1.094918 13270.422265
std 14.049960 6.098187 1.205493 12110.011237
min 18.000000 15.960000 0.000000 1121.873900
25% 27.000000 26.296250 0.000000 4740.287150
50% 39.000000 30.400000 1.000000 9382.033000
75% 51.000000 34.693750 2.000000 16639.912515
max 64.000000 53.130000 5.000000 63770.428010
Total number of NULL value in the dataset: age 0
sex 0
bmi 0
children 0
smoker 0
region 0
charges 0
dtype: int64
新建特征
BMI:
- Normal: bmi <= 24
- OverWeight: 24 < bmi <30
- Obese: bmi >= 30
1 | ## 按照bmi划分 |
age | sex | bmi | children | smoker | region | charges | risk_type | Age_group | |
---|---|---|---|---|---|---|---|---|---|
0 | 19 | female | 27.900 | 0 | yes | southwest | 16884.92400 | OverWeight | Age below 25 year |
1 | 18 | male | 33.770 | 1 | no | southeast | 1725.55230 | Obese | Age below 25 year |
2 | 28 | male | 33.000 | 3 | no | southeast | 4449.46200 | Obese | Age 25 to 34 year |
3 | 33 | male | 22.705 | 0 | no | northwest | 21984.47061 | Normal | Age 25 to 34 year |
4 | 32 | male | 28.880 | 0 | no | northwest | 3866.85520 | OverWeight | Age 25 to 34 year |
也可应用Pandas模块中的cut()
函数进行分组,具体可见Post not found: python-pandas模块 Pandas模块。
1 | ## 分组也可按如下操作 |
1 | df["Charge"] = df["charges"]/1000 |
数据可视化
1 | ## 相关图 |
<seaborn.axisgrid.PairGrid at 0x7f6074b5ced0>
- 对角线上的是特征的样本数据直方图
- 非对角线的是两两特征之间的相关图(以两特征分别为横纵轴)
charges
是Charge
除以1000得到的,二者完全正相关,因此相关图呈一条45度的直线bmi
与age
没有什么相关性- 不同
charges
区间中,charges
以及Charge
与age
呈正相关
1 | plt.rcParams["figure.figsize"] = (20, 14) |
Text(0.5, 1.0, 'Age_group distribution in the Data')
- 样本中的年龄主要集中分布在“小于20岁”区间
- 样本中肥胖的样本较多
- 样本中,抽烟的样本较少,不抽样的样本较多
1 | plt.rcParams['figure.figsize'] = (18, 8) |
<matplotlib.axes._subplots.AxesSubplot at 0x7f6069f93d90>
bmi
有随age
上升而上升的趋势charges
有随age
上升而上升的趋势
下面按年龄段绘制bmi
、charges
与age
的折线图:
1 | plt.rcParams['figure.figsize'] = (18, 8) |
<matplotlib.axes._subplots.AxesSubplot at 0x7f6069d0fad0>
1 | plt.rcParams["figure.figsize"]=(18,8) |
<matplotlib.axes._subplots.AxesSubplot at 0x7f6069c22150>
- 不吸烟的人的
Charge
整体比吸烟的人低
1 | ## region |
Text(0.5, 1.0, 'northeast region')
1 | plt.rcParams["figure.figsize"]=(16,6) |
<matplotlib.axes._subplots.AxesSubplot at 0x7f60692f6590>
1 | plt.rcParams["figure.figsize"]=(16,6) |
<matplotlib.axes._subplots.AxesSubplot at 0x7f6069128bd0>
1 | plt.rcParams["figure.figsize"]=(28,10) |
<matplotlib.axes._subplots.AxesSubplot at 0x7f6069054ad0>