目录

task2

目录
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

数据集有多少男性和女性?

data['sex'].value_counts()
Male      21790
Female    10771
Name: sex, dtype: int64

数据集女性的平均年龄

data[data['sex'] == 'Female']['age'].mean()
36.85823043357163

数据集中德国公民的比例是多少?

data['native-country'].value_counts(normalize=True)['Germany']
0.004207487485028101

年收入超过 50K 和低于 50K 人群年龄的平均值和标准差是多少?

salary1 = data[data['salary'] == '>50K']['age']
salary2 = data[data['salary'] == '<=50K']['age']
print(salary1.mean(),salary1.std())
print(salary2.mean(),salary2.std())
44.24984058155847 10.51902771985177
36.78373786407767 14.020088490824813

年收入超过 50K 的人群是否都接受过高中以上教育?

data[data['salary'] == ">50K"]['education'].unique()
array(['HS-grad', 'Masters', 'Bachelors', 'Some-college', 'Assoc-voc',
       'Doctorate', 'Prof-school', 'Assoc-acdm', '7th-8th', '12th',
       '10th', '11th', '9th', '5th-6th', '1st-4th'], dtype=object)

使用 groupby 和 describe 统计不同种族和性别人群的年龄分布数据。

for (race,sex),mini_data in data.groupby(['race','sex']):
    print(race,sex)
    print(mini_data['age'].describe())
Amer-Indian-Eskimo Female
count    119.000000
mean      37.117647
std       13.114991
min       17.000000
25%       27.000000
50%       36.000000
75%       46.000000
max       80.000000
Name: age, dtype: float64
Amer-Indian-Eskimo Male
count    192.000000
mean      37.208333
std       12.049563
min       17.000000
25%       28.000000
50%       35.000000
75%       45.000000
max       82.000000
Name: age, dtype: float64
Asian-Pac-Islander Female
count    346.000000
mean      35.089595
std       12.300845
min       17.000000
25%       25.000000
50%       33.000000
75%       43.750000
max       75.000000
Name: age, dtype: float64
Asian-Pac-Islander Male
count    693.000000
mean      39.073593
std       12.883944
min       18.000000
25%       29.000000
50%       37.000000
75%       46.000000
max       90.000000
Name: age, dtype: float64
Black Female
count    1555.000000
mean       37.854019
std        12.637197
min        17.000000
25%        28.000000
50%        37.000000
75%        46.000000
max        90.000000
Name: age, dtype: float64
Black Male
count    1569.000000
mean       37.682600
std        12.882612
min        17.000000
25%        27.000000
50%        36.000000
75%        46.000000
max        90.000000
Name: age, dtype: float64
Other Female
count    109.000000
mean      31.678899
std       11.631599
min       17.000000
25%       23.000000
50%       29.000000
75%       39.000000
max       74.000000
Name: age, dtype: float64
Other Male
count    162.000000
mean      34.654321
std       11.355531
min       17.000000
25%       26.000000
50%       32.000000
75%       42.000000
max       77.000000
Name: age, dtype: float64
White Female
count    8642.000000
mean       36.811618
std        14.329093
min        17.000000
25%        25.000000
50%        35.000000
75%        46.000000
max        90.000000
Name: age, dtype: float64
White Male
count    19174.000000
mean        39.652498
std         13.436029
min         17.000000
25%         29.000000
50%         38.000000
75%         49.000000
max         90.000000
Name: age, dtype: float64

统计男性高收入人群中已婚和未婚(包含离婚和分居)人群各自所占数量。

# 未婚
data[(data['sex'] == 'Male') &
     (data['marital-status'].isin(['Never-married',
                                   'Separated', 'Divorced']))]['salary'].value_counts()
<=50K    7423
>50K      658
Name: salary, dtype: int64
# 已婚
data[(data['sex'] == 'Male') &
     (data['marital-status'].str.startswith('Married'))]['salary'].value_counts()
<=50K    7576
>50K     5965
Name: salary, dtype: int64

计算各国超过和低于 50K 人群各自的平均周工作时长。

for (country, salary), sub_df in data.groupby(['native-country', 'salary']):
    print(country, salary, round(sub_df['hours-per-week'].mean(), 2))
? <=50K 40.16
? >50K 45.55
Cambodia <=50K 41.42
Cambodia >50K 40.0
Canada <=50K 37.91
Canada >50K 45.64
China <=50K 37.38
China >50K 38.9
Columbia <=50K 38.68
Columbia >50K 50.0
Cuba <=50K 37.99
Cuba >50K 42.44
Dominican-Republic <=50K 42.34
Dominican-Republic >50K 47.0
Ecuador <=50K 38.04
Ecuador >50K 48.75
El-Salvador <=50K 36.03
El-Salvador >50K 45.0
England <=50K 40.48
England >50K 44.53
France <=50K 41.06
France >50K 50.75
Germany <=50K 39.14
Germany >50K 44.98
Greece <=50K 41.81
Greece >50K 50.62
Guatemala <=50K 39.36
Guatemala >50K 36.67
Haiti <=50K 36.33
Haiti >50K 42.75
Holand-Netherlands <=50K 40.0
Honduras <=50K 34.33
Honduras >50K 60.0
Hong <=50K 39.14
Hong >50K 45.0
Hungary <=50K 31.3
Hungary >50K 50.0
India <=50K 38.23
India >50K 46.48
Iran <=50K 41.44
Iran >50K 47.5
Ireland <=50K 40.95
Ireland >50K 48.0
Italy <=50K 39.62
Italy >50K 45.4
Jamaica <=50K 38.24
Jamaica >50K 41.1
Japan <=50K 41.0
Japan >50K 47.96
Laos <=50K 40.38
Laos >50K 40.0
Mexico <=50K 40.0
Mexico >50K 46.58
Nicaragua <=50K 36.09
Nicaragua >50K 37.5
Outlying-US(Guam-USVI-etc) <=50K 41.86
Peru <=50K 35.07
Peru >50K 40.0
Philippines <=50K 38.07
Philippines >50K 43.03
Poland <=50K 38.17
Poland >50K 39.0
Portugal <=50K 41.94
Portugal >50K 41.5
Puerto-Rico <=50K 38.47
Puerto-Rico >50K 39.42
Scotland <=50K 39.44
Scotland >50K 46.67
South <=50K 40.16
South >50K 51.44
Taiwan <=50K 33.77
Taiwan >50K 46.8
Thailand <=50K 42.87
Thailand >50K 58.33
Trinadad&Tobago <=50K 37.06
Trinadad&Tobago >50K 40.0
United-States <=50K 38.8
United-States >50K 45.51
Vietnam <=50K 37.19
Vietnam >50K 39.2
Yugoslavia <=50K 41.6
Yugoslavia >50K 49.5
# 交叉表
pd.crosstab(data['native-country'], data['salary'],
            values=data['hours-per-week'], aggfunc=np.mean)
salary <=50K >50K
native-country
? 40.164760 45.547945
Cambodia 41.416667 40.000000
Canada 37.914634 45.641026
China 37.381818 38.900000
Columbia 38.684211 50.000000
Cuba 37.985714 42.440000
Dominican-Republic 42.338235 47.000000
Ecuador 38.041667 48.750000
El-Salvador 36.030928 45.000000
England 40.483333 44.533333
France 41.058824 50.750000
Germany 39.139785 44.977273
Greece 41.809524 50.625000
Guatemala 39.360656 36.666667
Haiti 36.325000 42.750000
Holand-Netherlands 40.000000 NaN
Honduras 34.333333 60.000000
Hong 39.142857 45.000000
Hungary 31.300000 50.000000
India 38.233333 46.475000
Iran 41.440000 47.500000
Ireland 40.947368 48.000000
Italy 39.625000 45.400000
Jamaica 38.239437 41.100000
Japan 41.000000 47.958333
Laos 40.375000 40.000000
Mexico 40.003279 46.575758
Nicaragua 36.093750 37.500000
Outlying-US(Guam-USVI-etc) 41.857143 NaN
Peru 35.068966 40.000000
Philippines 38.065693 43.032787
Poland 38.166667 39.000000
Portugal 41.939394 41.500000
Puerto-Rico 38.470588 39.416667
Scotland 39.444444 46.666667
South 40.156250 51.437500
Taiwan 33.774194 46.800000
Thailand 42.866667 58.333333
Trinadad&Tobago 37.058824 40.000000
United-States 38.799127 45.505369
Vietnam 37.193548 39.200000
Yugoslavia 41.600000 49.500000