|
age
|
workclass
|
fnlwgt
|
education
|
education-num
|
marital-status
|
occupation
|
relationship
|
race
|
sex
|
capital-gain
|
capital-loss
|
hours-per-week
|
native-country
|
salary
|
0
|
39
|
State-gov
|
77516
|
Bachelors
|
13
|
Never-married
|
Adm-clerical
|
Not-in-family
|
White
|
Male
|
2174
|
0
|
40
|
United-States
|
<=50K
|
1
|
50
|
Self-emp-not-inc
|
83311
|
Bachelors
|
13
|
Married-civ-spouse
|
Exec-managerial
|
Husband
|
White
|
Male
|
0
|
0
|
13
|
United-States
|
<=50K
|
2
|
38
|
Private
|
215646
|
HS-grad
|
9
|
Divorced
|
Handlers-cleaners
|
Not-in-family
|
White
|
Male
|
0
|
0
|
40
|
United-States
|
<=50K
|
3
|
53
|
Private
|
234721
|
11th
|
7
|
Married-civ-spouse
|
Handlers-cleaners
|
Husband
|
Black
|
Male
|
0
|
0
|
40
|
United-States
|
<=50K
|
4
|
28
|
Private
|
338409
|
Bachelors
|
13
|
Married-civ-spouse
|
Prof-specialty
|
Wife
|
Black
|
Female
|
0
|
0
|
40
|
Cuba
|
<=50K
|
数据集有多少男性和女性?
data['sex'].value_counts()
Male 21790
Female 10771
Name: sex, dtype: int64
数据集女性的平均年龄
data[data['sex'] == 'Female']['age'].mean()
36.85823043357163
数据集中德国公民的比例是多少?
data['native-country'].value_counts(normalize=True)['Germany']
0.004207487485028101
年收入超过
50K 和低于 50K 人群年龄的平均值和标准差是多少?
salary1 = data[data['salary'] == '>50K']['age']
salary2 = data[data['salary'] == '<=50K']['age']
print(salary1.mean(),salary1.std())
print(salary2.mean(),salary2.std())
44.24984058155847 10.51902771985177
36.78373786407767 14.020088490824813
年收入超过 50K
的人群是否都接受过高中以上教育?
data[data['salary'] == ">50K"]['education'].unique()
array(['HS-grad', 'Masters', 'Bachelors', 'Some-college', 'Assoc-voc',
'Doctorate', 'Prof-school', 'Assoc-acdm', '7th-8th', '12th',
'10th', '11th', '9th', '5th-6th', '1st-4th'], dtype=object)
使用
groupby 和 describe 统计不同种族和性别人群的年龄分布数据。
for (race,sex),mini_data in data.groupby(['race','sex']):
print(race,sex)
print(mini_data['age'].describe())
Amer-Indian-Eskimo Female
count 119.000000
mean 37.117647
std 13.114991
min 17.000000
25% 27.000000
50% 36.000000
75% 46.000000
max 80.000000
Name: age, dtype: float64
Amer-Indian-Eskimo Male
count 192.000000
mean 37.208333
std 12.049563
min 17.000000
25% 28.000000
50% 35.000000
75% 45.000000
max 82.000000
Name: age, dtype: float64
Asian-Pac-Islander Female
count 346.000000
mean 35.089595
std 12.300845
min 17.000000
25% 25.000000
50% 33.000000
75% 43.750000
max 75.000000
Name: age, dtype: float64
Asian-Pac-Islander Male
count 693.000000
mean 39.073593
std 12.883944
min 18.000000
25% 29.000000
50% 37.000000
75% 46.000000
max 90.000000
Name: age, dtype: float64
Black Female
count 1555.000000
mean 37.854019
std 12.637197
min 17.000000
25% 28.000000
50% 37.000000
75% 46.000000
max 90.000000
Name: age, dtype: float64
Black Male
count 1569.000000
mean 37.682600
std 12.882612
min 17.000000
25% 27.000000
50% 36.000000
75% 46.000000
max 90.000000
Name: age, dtype: float64
Other Female
count 109.000000
mean 31.678899
std 11.631599
min 17.000000
25% 23.000000
50% 29.000000
75% 39.000000
max 74.000000
Name: age, dtype: float64
Other Male
count 162.000000
mean 34.654321
std 11.355531
min 17.000000
25% 26.000000
50% 32.000000
75% 42.000000
max 77.000000
Name: age, dtype: float64
White Female
count 8642.000000
mean 36.811618
std 14.329093
min 17.000000
25% 25.000000
50% 35.000000
75% 46.000000
max 90.000000
Name: age, dtype: float64
White Male
count 19174.000000
mean 39.652498
std 13.436029
min 17.000000
25% 29.000000
50% 38.000000
75% 49.000000
max 90.000000
Name: age, dtype: float64
统计男性高收入人群中已婚和未婚(包含离婚和分居)人群各自所占数量。
# 未婚
data[(data['sex'] == 'Male') &
(data['marital-status'].isin(['Never-married',
'Separated', 'Divorced']))]['salary'].value_counts()
<=50K 7423
>50K 658
Name: salary, dtype: int64
# 已婚
data[(data['sex'] == 'Male') &
(data['marital-status'].str.startswith('Married'))]['salary'].value_counts()
<=50K 7576
>50K 5965
Name: salary, dtype: int64
计算各国超过和低于
50K 人群各自的平均周工作时长。
for (country, salary), sub_df in data.groupby(['native-country', 'salary']):
print(country, salary, round(sub_df['hours-per-week'].mean(), 2))
? <=50K 40.16
? >50K 45.55
Cambodia <=50K 41.42
Cambodia >50K 40.0
Canada <=50K 37.91
Canada >50K 45.64
China <=50K 37.38
China >50K 38.9
Columbia <=50K 38.68
Columbia >50K 50.0
Cuba <=50K 37.99
Cuba >50K 42.44
Dominican-Republic <=50K 42.34
Dominican-Republic >50K 47.0
Ecuador <=50K 38.04
Ecuador >50K 48.75
El-Salvador <=50K 36.03
El-Salvador >50K 45.0
England <=50K 40.48
England >50K 44.53
France <=50K 41.06
France >50K 50.75
Germany <=50K 39.14
Germany >50K 44.98
Greece <=50K 41.81
Greece >50K 50.62
Guatemala <=50K 39.36
Guatemala >50K 36.67
Haiti <=50K 36.33
Haiti >50K 42.75
Holand-Netherlands <=50K 40.0
Honduras <=50K 34.33
Honduras >50K 60.0
Hong <=50K 39.14
Hong >50K 45.0
Hungary <=50K 31.3
Hungary >50K 50.0
India <=50K 38.23
India >50K 46.48
Iran <=50K 41.44
Iran >50K 47.5
Ireland <=50K 40.95
Ireland >50K 48.0
Italy <=50K 39.62
Italy >50K 45.4
Jamaica <=50K 38.24
Jamaica >50K 41.1
Japan <=50K 41.0
Japan >50K 47.96
Laos <=50K 40.38
Laos >50K 40.0
Mexico <=50K 40.0
Mexico >50K 46.58
Nicaragua <=50K 36.09
Nicaragua >50K 37.5
Outlying-US(Guam-USVI-etc) <=50K 41.86
Peru <=50K 35.07
Peru >50K 40.0
Philippines <=50K 38.07
Philippines >50K 43.03
Poland <=50K 38.17
Poland >50K 39.0
Portugal <=50K 41.94
Portugal >50K 41.5
Puerto-Rico <=50K 38.47
Puerto-Rico >50K 39.42
Scotland <=50K 39.44
Scotland >50K 46.67
South <=50K 40.16
South >50K 51.44
Taiwan <=50K 33.77
Taiwan >50K 46.8
Thailand <=50K 42.87
Thailand >50K 58.33
Trinadad&Tobago <=50K 37.06
Trinadad&Tobago >50K 40.0
United-States <=50K 38.8
United-States >50K 45.51
Vietnam <=50K 37.19
Vietnam >50K 39.2
Yugoslavia <=50K 41.6
Yugoslavia >50K 49.5
# 交叉表
pd.crosstab(data['native-country'], data['salary'],
values=data['hours-per-week'], aggfunc=np.mean)
salary
|
<=50K
|
>50K
|
native-country
|
|
|
?
|
40.164760
|
45.547945
|
Cambodia
|
41.416667
|
40.000000
|
Canada
|
37.914634
|
45.641026
|
China
|
37.381818
|
38.900000
|
Columbia
|
38.684211
|
50.000000
|
Cuba
|
37.985714
|
42.440000
|
Dominican-Republic
|
42.338235
|
47.000000
|
Ecuador
|
38.041667
|
48.750000
|
El-Salvador
|
36.030928
|
45.000000
|
England
|
40.483333
|
44.533333
|
France
|
41.058824
|
50.750000
|
Germany
|
39.139785
|
44.977273
|
Greece
|
41.809524
|
50.625000
|
Guatemala
|
39.360656
|
36.666667
|
Haiti
|
36.325000
|
42.750000
|
Holand-Netherlands
|
40.000000
|
NaN
|
Honduras
|
34.333333
|
60.000000
|
Hong
|
39.142857
|
45.000000
|
Hungary
|
31.300000
|
50.000000
|
India
|
38.233333
|
46.475000
|
Iran
|
41.440000
|
47.500000
|
Ireland
|
40.947368
|
48.000000
|
Italy
|
39.625000
|
45.400000
|
Jamaica
|
38.239437
|
41.100000
|
Japan
|
41.000000
|
47.958333
|
Laos
|
40.375000
|
40.000000
|
Mexico
|
40.003279
|
46.575758
|
Nicaragua
|
36.093750
|
37.500000
|
Outlying-US(Guam-USVI-etc)
|
41.857143
|
NaN
|
Peru
|
35.068966
|
40.000000
|
Philippines
|
38.065693
|
43.032787
|
Poland
|
38.166667
|
39.000000
|
Portugal
|
41.939394
|
41.500000
|
Puerto-Rico
|
38.470588
|
39.416667
|
Scotland
|
39.444444
|
46.666667
|
South
|
40.156250
|
51.437500
|
Taiwan
|
33.774194
|
46.800000
|
Thailand
|
42.866667
|
58.333333
|
Trinadad&Tobago
|
37.058824
|
40.000000
|
United-States
|
38.799127
|
45.505369
|
Vietnam
|
37.193548
|
39.200000
|
Yugoslavia
|
41.600000
|
49.500000
|