用于可视化缺失值
missingno
安装:
首先下载样本数据:
1 2 pip install quilt quilt install ResidentMario/missingno_data
加载数据:
1 2 3 4 5 6 import numpy as npfrom quilt.data.ResidentMario import missingno_datacollisions = missingno_data.nyc_collision_factors() collisions = collisions.replace('nan' , np.nan) collisions.head()
Unnamed: 0
DATE
TIME
BOROUGH
ZIP CODE
LATITUDE
LONGITUDE
LOCATION
ON STREET NAME
CROSS STREET NAME
OFF STREET NAME
NUMBER OF PERSONS INJURED
NUMBER OF PERSONS KILLED
NUMBER OF PEDESTRIANS INJURED
NUMBER OF PEDESTRIANS KILLED
NUMBER OF CYCLISTS INJURED
NUMBER OF CYCLISTS KILLED
CONTRIBUTING FACTOR VEHICLE 1
CONTRIBUTING FACTOR VEHICLE 2
CONTRIBUTING FACTOR VEHICLE 3
CONTRIBUTING FACTOR VEHICLE 4
CONTRIBUTING FACTOR VEHICLE 5
VEHICLE TYPE CODE 1
VEHICLE TYPE CODE 2
VEHICLE TYPE CODE 3
VEHICLE TYPE CODE 4
VEHICLE TYPE CODE 5
0
0
11/10/2016
16:11:00
BROOKLYN
11208.0
40.662514
-73.872007
(40.6625139, -73.8720068)
WORTMAN AVENUE
MONTAUK AVENUE
NaN
0
0
0
0
NaN
NaN
Failure to Yield Right-of-Way
Unspecified
NaN
NaN
NaN
TAXI
PASSENGER VEHICLE
NaN
NaN
NaN
1
1
11/10/2016
05:11:00
MANHATTAN
10013.0
40.721323
-74.008344
(40.7213228, -74.0083444)
HUBERT STREET
HUDSON STREET
NaN
1
0
1
0
NaN
NaN
Failure to Yield Right-of-Way
NaN
NaN
NaN
NaN
PASSENGER VEHICLE
NaN
NaN
NaN
NaN
2
2
04/16/2016
09:15:00
BROOKLYN
11201.0
40.687999
-73.997563
(40.6879989, -73.9975625)
HENRY STREET
WARREN STREET
NaN
0
0
0
0
NaN
NaN
Lost Consciousness
Lost Consciousness
NaN
NaN
NaN
PASSENGER VEHICLE
VAN
NaN
NaN
NaN
3
3
04/15/2016
10:20:00
QUEENS
11375.0
40.719228
-73.854542
(40.7192276, -73.8545422)
NaN
NaN
67-64 FLEET STREET
0
0
0
0
NaN
NaN
Failure to Yield Right-of-Way
Failure to Yield Right-of-Way
Failure to Yield Right-of-Way
NaN
NaN
PASSENGER VEHICLE
PASSENGER VEHICLE
PASSENGER VEHICLE
NaN
NaN
4
4
04/15/2016
10:35:00
BROOKLYN
11210.0
40.632147
-73.952731
(40.6321467, -73.9527315)
BEDFORD AVENUE
CAMPUS ROAD
NaN
0
0
0
0
NaN
NaN
Failure to Yield Right-of-Way
Failure to Yield Right-of-Way
NaN
NaN
NaN
PASSENGER VEHICLE
PASSENGER VEHICLE
NaN
NaN
NaN
bar()
1 msno.bar(collisions.sample(1000 ))
Dendrogram()
谱系图/系统树图
1 msno.dendrogram(collisions)
NUMBER OF CYCLISTS INJURED
,NUMBER OF CYCLISTS SKILLED
,CONTRIBUTING FACTOR VEHICLE 1
,NUMBER OF PEDESTRIANS SKILLED
,NUMBER OF PEDESTRIANS INJURED
,NUMBER OF PERSONS KILLED
等数据完整,没有缺失值,他们的距离为零,聚为一类。
BOROUGH
,ZIP CODE
缺失相关性为1(同时缺失),距离为零;且缺失数据最少(除完整数据外),所以在完整数据后聚为一类。
……
heatmap()
热力图
体现一个变量的存在或不存在如何强烈影响另一个变量的存在
1 msno.heatmap(collisions)
ZIP CODE
与BOROUGH
的缺失相关性为1,说明:只要BOROUGH
发生了缺失,ZIP CODE
也会缺失。
matrix()
1 2 3 4 import missingno as msno%matplotlib inline msno.matrix(collisions.sample(250 ))
白色的为缺失
1 2 3 4 5 import pandas as pdnull_pattern = (np.random.random(1000 ).reshape((50 , 20 )) > 0.5 ).astype(bool ) null_pattern = pd.DataFrame(null_pattern).replace({False : None }) msno.matrix(null_pattern.set_index(pd.period_range('1/1/2011' , '2/1/2015' , freq='M' )), freq='BQ' )
参考资料