0%

Python | missingno

用于可视化缺失值

missingno

安装:

1
pip install missingno

首先下载样本数据:

1
2
pip install quilt
quilt install ResidentMario/missingno_data

加载数据:

1
2
3
4
5
6
import numpy as np
from quilt.data.ResidentMario import missingno_data

collisions = missingno_data.nyc_collision_factors()
collisions = collisions.replace('nan', np.nan)
collisions.head()
Unnamed: 0 DATE TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME CROSS STREET NAME OFF STREET NAME NUMBER OF PERSONS INJURED NUMBER OF PERSONS KILLED NUMBER OF PEDESTRIANS INJURED NUMBER OF PEDESTRIANS KILLED NUMBER OF CYCLISTS INJURED NUMBER OF CYCLISTS KILLED CONTRIBUTING FACTOR VEHICLE 1 CONTRIBUTING FACTOR VEHICLE 2 CONTRIBUTING FACTOR VEHICLE 3 CONTRIBUTING FACTOR VEHICLE 4 CONTRIBUTING FACTOR VEHICLE 5 VEHICLE TYPE CODE 1 VEHICLE TYPE CODE 2 VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4 VEHICLE TYPE CODE 5
0 0 11/10/2016 16:11:00 BROOKLYN 11208.0 40.662514 -73.872007 (40.6625139, -73.8720068) WORTMAN AVENUE MONTAUK AVENUE NaN 0 0 0 0 NaN NaN Failure to Yield Right-of-Way Unspecified NaN NaN NaN TAXI PASSENGER VEHICLE NaN NaN NaN
1 1 11/10/2016 05:11:00 MANHATTAN 10013.0 40.721323 -74.008344 (40.7213228, -74.0083444) HUBERT STREET HUDSON STREET NaN 1 0 1 0 NaN NaN Failure to Yield Right-of-Way NaN NaN NaN NaN PASSENGER VEHICLE NaN NaN NaN NaN
2 2 04/16/2016 09:15:00 BROOKLYN 11201.0 40.687999 -73.997563 (40.6879989, -73.9975625) HENRY STREET WARREN STREET NaN 0 0 0 0 NaN NaN Lost Consciousness Lost Consciousness NaN NaN NaN PASSENGER VEHICLE VAN NaN NaN NaN
3 3 04/15/2016 10:20:00 QUEENS 11375.0 40.719228 -73.854542 (40.7192276, -73.8545422) NaN NaN 67-64 FLEET STREET 0 0 0 0 NaN NaN Failure to Yield Right-of-Way Failure to Yield Right-of-Way Failure to Yield Right-of-Way NaN NaN PASSENGER VEHICLE PASSENGER VEHICLE PASSENGER VEHICLE NaN NaN
4 4 04/15/2016 10:35:00 BROOKLYN 11210.0 40.632147 -73.952731 (40.6321467, -73.9527315) BEDFORD AVENUE CAMPUS ROAD NaN 0 0 0 0 NaN NaN Failure to Yield Right-of-Way Failure to Yield Right-of-Way NaN NaN NaN PASSENGER VEHICLE PASSENGER VEHICLE NaN NaN NaN

bar()

1
msno.bar(collisions.sample(1000))

Dendrogram()

谱系图/系统树图

1
msno.dendrogram(collisions)
  • NUMBER OF CYCLISTS INJURED,NUMBER OF CYCLISTS SKILLED,CONTRIBUTING FACTOR VEHICLE 1,NUMBER OF PEDESTRIANS SKILLED,NUMBER OF PEDESTRIANS INJURED,NUMBER OF PERSONS KILLED等数据完整,没有缺失值,他们的距离为零,聚为一类。
  • BOROUGH,ZIP CODE缺失相关性为1(同时缺失),距离为零;且缺失数据最少(除完整数据外),所以在完整数据后聚为一类。
  • ……

heatmap()

热力图
体现一个变量的存在或不存在如何强烈影响另一个变量的存在

1
msno.heatmap(collisions)

ZIP CODEBOROUGH的缺失相关性为1,说明:只要BOROUGH发生了缺失,ZIP CODE也会缺失。

matrix()

1
2
3
4
import missingno as msno
%matplotlib inline

msno.matrix(collisions.sample(250))

白色的为缺失

1
2
3
4
5
import pandas as pd

null_pattern = (np.random.random(1000).reshape((50, 20)) > 0.5).astype(bool)
null_pattern = pd.DataFrame(null_pattern).replace({False: None})
msno.matrix(null_pattern.set_index(pd.period_range('1/1/2011', '2/1/2015', freq='M')), freq='BQ')
1

1

参考资料

Thank you for your approval.

欢迎关注我的其它发布渠道