apply 메서드 활용¶

apply 메서드는 사용자가 작성한 함수를 한 번에 데이터 프레임의 각 행과 열에 적용하여 실행할 수 있게 해주는 메서드

데이터 프레임의 누락값을 처리한 다음 apply 메서드 사용하기¶

1. 데이터 프레임의 누락값 처리하기 - 열방향¶

In [1]:

import seaborn as sns
titanic = sns.load_dataset("titanic")
titanic.head()

Out[1]:

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

In [2]:

titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB

In [3]:

# count_missing 누락값의 개수
import numpy as np
import pandas as pd
def count_missing(vec):
    null_vec = pd.isnull(vec)
#     print(null_vec)
    null_count = np.sum(null_vec)
    return null_count

In [4]:

cmis_col = titanic.apply(count_missing)
cmis_col

Out[4]:

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [5]:

# prop_missing 누락값의 비율을 계산
# count_missing 함수를 이용하여 누락값의 개수를 구하고
# size 속성을 이용하여 df의 전체 데이터 수를 구하여 나눔

def prop_missing(vec):
    num = count_missing(vec)
    dem = vec.size
#     print(dem)
#     print('num', num)
    return num/dem

In [6]:

pmis_col = titanic.apply(prop_missing)
pmis_col

Out[6]:

survived       0.000000
pclass         0.000000
sex            0.000000
age            0.198653
sibsp          0.000000
parch          0.000000
fare           0.000000
embarked       0.002245
class          0.000000
who            0.000000
adult_male     0.000000
deck           0.772166
embark_town    0.002245
alive          0.000000
alone          0.000000
dtype: float64

In [7]:

# 누락값이 아닌 데이터 비율을 구함
# 전체(1) - 누락값의 비율
def prop_complete(vec):
    return 1 - prop_missing(vec)

데이터 프레임의 누락값을 처리하기 - 행방향¶

In [8]:

cmis_row = titanic.apply(count_missing, axis=1)
pmis_row = titanic.apply(prop_missing, axis =1)
pcom_row = titanic.apply(prop_complete, axis =1)
cmis_row

Out[8]:

0      1
1      0
2      1
3      0
4      1
      ..
886    1
887    0
888    2
889    0
890    1
Length: 891, dtype: int64

In [9]:

titanic['num_missing'] = titanic.apply(count_missing, axis = 1)
titanic.head()

Out[9]:

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone	num_missing
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False	1
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False	0
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True	1
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False	0
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True	1

In [10]:

# 누락값이 있는 데이터만 따로 모아옴
# 누락값이 2개 이상인 데이터 10개
titanic.loc[titanic.num_missing > 1, :].sample(10)

Out[10]:

	survived	pclass	sex	age	sibsp	parch	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone	num_missing
240	0	3	female	NaN	1	0	14.4542	C	Third	woman	False	NaN	Cherbourg	no	False	2
468	0	3	male	NaN	0	0	7.7250	Q	Third	man	True	NaN	Queenstown	no	True	2
431	1	3	female	NaN	1	0	16.1000	S	Third	woman	False	NaN	Southampton	yes	False	2
264	0	3	female	NaN	0	0	7.7500	Q	Third	woman	False	NaN	Queenstown	no	True	2
863	0	3	female	NaN	8	2	69.5500	S	Third	woman	False	NaN	Southampton	no	False	2
277	0	2	male	NaN	0	0	0.0000	S	Second	man	True	NaN	Southampton	no	True	2
589	0	3	male	NaN	0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True	2
768	0	3	male	NaN	1	0	24.1500	Q	Third	man	True	NaN	Queenstown	no	False	2
656	0	3	male	NaN	0	0	7.8958	S	Third	man	True	NaN	Southampton	no	True	2
375	1	1	female	NaN	1	0	82.1708	C	First	woman	False	NaN	Cherbourg	yes	False	2

문제¶

In [11]:

encore= [{'name': 'A', 'birth': '1999-06-27', 'mid': 95, 'fin': 85},
              {'name': 'B', 'birth': '1997-06-27', 'mid': 85, 'fin': 80},
              {'name': 'C', 'birth': '1998-06-27', 'mid': 10, 'fin': 30},
              {'name': 'D', 'birth': '2000-06-27', 'mid': 73, 'fin': 90}]
df = pd.DataFrame(encore, columns = ['name', 'birth', 'mid', 'fin'])
df
student = [{'name': 'A', 'birth': '1999-06-27', 'mid': 95, 'fin': 85},
              {'name': 'B', 'birth': '1997-06-27', 'mid': 85, 'fin': 80},
              {'name': 'C', 'birth': '1998-06-27', 'mid': 10, 'fin': 30},
              {'name': 'D', 'birth': '2000-06-27', 'mid': 73, 'fin': 90}]
df = pd.DataFrame(student, columns = ['name', 'birth', 'mid', 'fin'])

기존 두 칼럼(mid, fin)을 사용해서 tot 칼럼을 생성하세요.
기존 칼럼(mid, fin)을 사용해서 avg 칼럼을 생성하세요.
grade칼럼을 생성하시고, if~else문 사용 avg 칼럼이 90이상이면 'A', 80 이상이면 'B' 70 이상이면 'C', 그렇지 않으면 'F'를 부여하세요.
A,B,C이면 '합격' 그렇지 않으면 '불합격' 을 부여하는 함수를 만들고, apply 함수를 사용하여 grade1칼럼에 합격,불합격 부여하세요.
출생년도만을 가져오는 함수를 만들고, apply 메서드를 사용하여 연월일의 정보에서 출생년도만 추출하여 year칼럼에 부여하세요.
나이를 계산하는 함수를 만들고, apply 메서드를 사용하여 연월일의 정보에서 연도만 빼서 age칼럼에 부여하세요.
나이가 22세 이상인 사람만 출력해 보세요
나이가 22세 이상이고, 이름이 'A'인 사람만 출력해 보세요
birth ~ grade 열까지 출력해 보세요.
'name'과 grade 열만 추출해 보세요.

In [12]:

df

Out[12]:

	name	birth	mid	fin
0	A	1999-06-27	95	85
1	B	1997-06-27	85	80
2	C	1998-06-27	10	30
3	D	2000-06-27	73	90

In [13]:

# 1번

df['tot'] = df['mid']+ df['fin']
df

Out[13]:

	name	birth	mid	fin	tot
0	A	1999-06-27	95	85	180
1	B	1997-06-27	85	80	165
2	C	1998-06-27	10	30	40
3	D	2000-06-27	73	90	163

In [14]:

# 2번

df['avg'] = (df['mid'] + df['fin'])/2
df

Out[14]:

	name	birth	mid	fin	tot	avg
0	A	1999-06-27	95	85	180	90.0
1	B	1997-06-27	85	80	165	82.5
2	C	1998-06-27	10	30	40	20.0
3	D	2000-06-27	73	90	163	81.5

In [15]:

# 3번. grade칼럼을 생성하시고, if~else문 사용 avg 칼럼이 90이상이면 
# 'A', 80 이상이면 'B' 70 이상이면 'C', 그렇지 않으면 'F'를 부여하세요.

def grade(avg):
    if avg >= 90:
        return 'A'
    elif avg >= 80:
        return 'B'
    elif avg >= 80:
        return 'C'
    else:
        return 'F'

df['grade'] = df['avg'].apply(grade)
df

Out[15]:

	name	birth	mid	fin	tot	avg	grade
0	A	1999-06-27	95	85	180	90.0	A
1	B	1997-06-27	85	80	165	82.5	B
2	C	1998-06-27	10	30	40	20.0	F
3	D	2000-06-27	73	90	163	81.5	B

In [16]:

# 4번 .A,B,C이면 '합격' 그렇지 않으면 '불합격' 을 부여하는 함수를 만들고, 
# apply 함수를 사용하여 grade1칼럼에 합격,불합격 부여하세요.

def prob4(grade):
    if grade == 'F':
        return '불합격'
    else: 
        return '합격'
df['grade1'] = df['grade'].apply(prob4)
df

Out[16]:

	name	birth	mid	fin	tot	avg	grade	grade1
0	A	1999-06-27	95	85	180	90.0	A	합격
1	B	1997-06-27	85	80	165	82.5	B	합격
2	C	1998-06-27	10	30	40	20.0	F	불합격
3	D	2000-06-27	73	90	163	81.5	B	합격

In [17]:

# 5번.출생년도만을 가져오는 함수를 만들고, apply 메서드를 사용하여 연월일의 정보에서 출생년도만 추출하여 year칼럼에 부여하세요.

def birthday(birth):
    year = birth.split('-')[0]
    return int(year)

df['year'] = df['birth'].apply(birthday)
df

Out[17]:

	name	birth	mid	fin	tot	avg	grade	grade1	year
0	A	1999-06-27	95	85	180	90.0	A	합격	1999
1	B	1997-06-27	85	80	165	82.5	B	합격	1997
2	C	1998-06-27	10	30	40	20.0	F	불합격	1998
3	D	2000-06-27	73	90	163	81.5	B	합격	2000

In [18]:

# 6번. 나이를 계산하는 함수를 만들고,
# apply 메서드를 사용하여 연월일의 정보에서 연도만 빼서 age칼럼에 부여하세요.

def cal_age(year):
    age = 2023 - year
    return age

df['age'] = df['year'].apply(cal_age)
df

Out[18]:

	name	birth	mid	fin	tot	avg	grade	grade1	year	age
0	A	1999-06-27	95	85	180	90.0	A	합격	1999	24
1	B	1997-06-27	85	80	165	82.5	B	합격	1997	26
2	C	1998-06-27	10	30	40	20.0	F	불합격	1998	25
3	D	2000-06-27	73	90	163	81.5	B	합격	2000	23

In [19]:

# 7번. 나이가 22세 이상인 사람만 출력해 보세요
df[df['age']>=22]

Out[19]:

	name	birth	mid	fin	tot	avg	grade	grade1	year	age
0	A	1999-06-27	95	85	180	90.0	A	합격	1999	24
1	B	1997-06-27	85	80	165	82.5	B	합격	1997	26
2	C	1998-06-27	10	30	40	20.0	F	불합격	1998	25
3	D	2000-06-27	73	90	163	81.5	B	합격	2000	23

In [20]:

# 8번. 나이가 22세 이상이고, 이름이 'A'인 사람만 출력해 보세요
df[(df['age']>=22) & (df['name']=='A')]

Out[20]:

	name	birth	mid	fin	tot	avg	grade	grade1	year	age
0	A	1999-06-27	95	85	180	90.0	A	합격	1999	24

In [21]:

# 9번. birth ~ grade 열까지 출력해 보세요
df.loc[:, 'birth':'grade']

Out[21]:

	birth	mid	fin	tot	avg	grade
0	1999-06-27	95	85	180	90.0	A
1	1997-06-27	85	80	165	82.5	B
2	1998-06-27	10	30	40	20.0	F
3	2000-06-27	73	90	163	81.5	B

In [22]:

# 10번. 'name'과 grade 열만 추출해 보세요.
df[['name','grade']]

Out[22]:

	name	grade
0	A	A
1	B	B
2	C	F
3	D	B

[23.06.19] Python pandas titanic problem - 12(2) (0)	2023.06.19
[23.06.19] Python groupby, aggregate, transform, filter - 12(1) (0)	2023.06.19
[23.06.16] Python pivot_table - 11(2) (0)	2023.06.16
[23.06.16] Python melt - 11(1) (0)	2023.06.16
[23.06.15] Python 누락 값 처리 - 10(3) (0)	2023.06.15

Woogi

[23.06.16] Python apply, 문제 - 11(4)

apply 메서드 활용¶

데이터 프레임의 누락값을 처리한 다음 apply 메서드 사용하기¶

1. 데이터 프레임의 누락값 처리하기 - 열방향¶

데이터 프레임의 누락값을 처리하기 - 행방향¶

문제¶

'데이터분석' 카테고리의 다른 글

'데이터분석'의 다른글

티스토리툴바

[23.06.16] Python apply, 문제 - 11(4)

apply 메서드 활용¶

데이터 프레임의 누락값을 처리한 다음 apply 메서드 사용하기¶

1. 데이터 프레임의 누락값 처리하기 - 열방향¶

데이터 프레임의 누락값을 처리하기 - 행방향¶

문제¶

'데이터분석' 카테고리의 다른 글

'데이터분석'의 다른글

관련글

티스토리툴바