Python
[pandas] group으로 묶기
Jueun Park
2020. 9. 24. 12:53
Score
Group으로 묶어서 특정 과목 점수가 높은 학생 출력하기
import pandas as pd
# DataFrame 생성
score = pd.DataFrame({
"Name": ["Jueun", "Jieun", "Terry", "Mits"],
"Gender": ["girl", "girl", "boy", "boy"],
"Class": [1, 2, 2, 1],
"Total": [0, 0, 0, 0],
"Eng": [50, 90, 40, 70],
"kor": [30, 70, 90, 100]
})
# Name을 score의 인덱스로 설정해줌
score = score.set_index(["Name"])
# Total column을 채워줌(kor + Eng = Total)
for index in range(4):
print("====={}=====".format(index))
score.iloc[index, 2] = int(score.iloc[index, 3:].sum())
print(score.iloc[index])
# Group 짓기(Class를 기준으로 묶기)
grouped = score.groupby("Class")
# 각 Class에서 Eng 과목 점수가 제일 높은 사람 출력하기
for group_name, group_data in grouped:
eng_king = score[(score["Class"]==group_name) & (score["Eng"]==group_data["Eng"].max())]
print(eng_king)
print("=" * 100)
#출력 결과:
Gender Class Total Eng Kor
Name
Mits boy 1 170.0 70 100
==================================================================================
Gender Class Total Eng Kor
Name
Jieun girl 2 160.0 90 70
==================================================================================
Titanic
생존자의 비율을 Class별로 알아보기
import pandas as pd
# 시각화 패키지
import seaborn as sns
titanic = sns.load_dataset('titanic')
# 모든 row에 대한 column 선택적으로 보기
df = titanic.loc[:, ['age', 'sex', 'class', 'fare', 'survived']]
print('승객 수:', len(df))
# 승객 수: 891
# 그룹짓기
grouped = df.groupby(['class'])
# 출력
for key, group in grouped:
print('* key :', key)
print('*number :', len(group))
print(group.head())
print('\n')
3개의 클래스로 구분되기 때문에 3개의 그룹이 나온다.
* key : First
*number : 216
age sex class fare survived
1 38.0 female First 71.2833 1
3 35.0 female First 53.1000 1
6 54.0 male First 51.8625 0
11 58.0 female First 26.5500 1
23 28.0 male First 35.5000 1
* key : Second
*number : 184
age sex class fare survived
9 14.0 female Second 30.0708 1
15 55.0 female Second 16.0000 1
17 NaN male Second 13.0000 1
20 35.0 male Second 26.0000 0
21 34.0 male Second 13.0000 1
* key : Third
*number : 491
age sex class fare survived
0 22.0 male Third 7.2500 0
2 26.0 female Third 7.9250 1
4 35.0 male Third 8.0500 0
5 NaN male Third 8.4583 0
7 2.0 male Third 21.0750 0
각 그룹별 생존 비율
# 평균 출력
average = grouped.mean()
print(average)
출력결과
age fare survived
class
First 38.233441 84.154687 0.629630
Second 29.877630 20.662183 0.472826
Third 25.140620 13.675550 0.242363
생존자의 비율을 Class, Gender로 알아보기
# 탑승 클래스와 성별을 기준으로 그룹 나누기
grouped_two = df.groupby(['class', 'sex'])
# 출력
for key, group in grouped_two:
print('* key :', key)
print('* number :', len(group))
print(group.head())
print('\n')
3개의 클래스 * 2개의 성별로 구분되기 때문에 6개의 그룹이 나온다.
* key : ('First', 'female')
* number : 94
age sex class fare survived
1 38.0 female First 71.2833 1
3 35.0 female First 53.1000 1
11 58.0 female First 26.5500 1
31 NaN female First 146.5208 1
52 49.0 female First 76.7292 1
* key : ('First', 'male')
* number : 122
age sex class fare survived
6 54.0 male First 51.8625 0
23 28.0 male First 35.5000 1
27 19.0 male First 263.0000 0
30 40.0 male First 27.7208 0
34 28.0 male First 82.1708 0
* key : ('Second', 'female')
* number : 76
age sex class fare survived
9 14.0 female Second 30.0708 1
15 55.0 female Second 16.0000 1
41 27.0 female Second 21.0000 0
43 3.0 female Second 41.5792 1
53 29.0 female Second 26.0000 1
* key : ('Second', 'male')
* number : 108
age sex class fare survived
17 NaN male Second 13.0 1
20 35.0 male Second 26.0 0
21 34.0 male Second 13.0 1
33 66.0 male Second 10.5 0
70 32.0 male Second 10.5 0
* key : ('Third', 'female')
* number : 144
age sex class fare survived
2 26.0 female Third 7.9250 1
8 27.0 female Third 11.1333 1
10 4.0 female Third 16.7000 1
14 14.0 female Third 7.8542 0
18 31.0 female Third 18.0000 0
* key : ('Third', 'male')
* number : 347
age sex class fare survived
0 22.0 male Third 7.2500 0
4 35.0 male Third 8.0500 0
5 NaN male Third 8.4583 0
7 2.0 male Third 21.0750 0
12 20.0 male Third 8.0500 0
각 그룹별 생존 비율
# 각 그룹별로 생존 비율 알아보기
average_two = grouped_two.mean()
print(average_two)
print('\n')
print(type(average_two))
출력결과
age fare survived
class sex
First female 34.611765 106.125798 0.968085
male 41.281386 67.226127 0.368852
Second female 28.722973 21.970121 0.921053
male 30.740707 19.741782 0.157407
Third female 21.750000 16.118810 0.500000
male 26.507589 12.661633 0.135447
<class 'pandas.core.frame.DataFrame'>
삼등칸 여성승객의 생존 비율
# 6개의 그룹으로 나뉘어진 grouped_two에서 삼등칸 여성승객의 정보만 골라서 가져온다.
# group3f 변수에 저장
group3f = grouped_two.get_group(('Third', 'female'))
print(group3.head())
# 출력
age sex class fare survived
0 22.0 male Third 7.2500 0
2 26.0 female Third 7.9250 1
4 35.0 male Third 8.0500 0
5 NaN male Third 8.4583 0
7 2.0 male Third 21.0750 0
탑승 Class 가격 비교
# class의 정보중에 fare만 조회
df.groupby(["class"]).std()
class
First 78.380373
Second 13.417399
Third 11.778142
Name: fare, dtype: float64