2023.06.12 - [데이터분석] - [23.06.12] Python pandas - 07(1)
의 코드
In [1]:
import pandas as pd
In [2]:
df = pd.read_csv('../data/gapminder.tsv', sep='\t')
df
Out[2]:
country | continent | year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.445314 |
1 | Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.853030 |
2 | Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.100710 |
3 | Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.197138 |
4 | Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.981106 |
... | ... | ... | ... | ... | ... | ... |
1699 | Zimbabwe | Africa | 1987 | 62.351 | 9216418 | 706.157306 |
1700 | Zimbabwe | Africa | 1992 | 60.377 | 10704340 | 693.420786 |
1701 | Zimbabwe | Africa | 1997 | 46.809 | 11404948 | 792.449960 |
1702 | Zimbabwe | Africa | 2002 | 39.989 | 11926563 | 672.038623 |
1703 | Zimbabwe | Africa | 2007 | 43.487 | 12311143 | 469.709298 |
1704 rows × 6 columns
In [3]:
df.head() # 데이터 처음 5개를 보여줌
Out[3]:
country | continent | year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.445314 |
1 | Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.853030 |
2 | Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.100710 |
3 | Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.197138 |
4 | Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.981106 |
In [4]:
df.tail() # 데이터 뒤에서 5개 보여줌
Out[4]:
country | continent | year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|---|---|
1699 | Zimbabwe | Africa | 1987 | 62.351 | 9216418 | 706.157306 |
1700 | Zimbabwe | Africa | 1992 | 60.377 | 10704340 | 693.420786 |
1701 | Zimbabwe | Africa | 1997 | 46.809 | 11404948 | 792.449960 |
1702 | Zimbabwe | Africa | 2002 | 39.989 | 11926563 | 672.038623 |
1703 | Zimbabwe | Africa | 2007 | 43.487 | 12311143 | 469.709298 |
In [5]:
type(df)
Out[5]:
pandas.core.frame.DataFrame
In [6]:
print(df.shape) # 행, 열
print(df.shape[0])
print(df.shape[1])
(1704, 6)
1704
6
In [7]:
print(df.columns)
Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')
In [8]:
print(df.dtypes)
country object
continent object
year int64
lifeExp float64
pop int64
gdpPercap float64
dtype: object
In [9]:
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 1704 non-null object
1 continent 1704 non-null object
2 year 1704 non-null int64
3 lifeExp 1704 non-null float64
4 pop 1704 non-null int64
5 gdpPercap 1704 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB
None
열 단위로 데이터 추출¶
In [10]:
country_df = df['country']
type(country_df)
Out[10]:
pandas.core.series.Series
In [11]:
country_df
Out[11]:
0 Afghanistan
1 Afghanistan
2 Afghanistan
3 Afghanistan
4 Afghanistan
...
1699 Zimbabwe
1700 Zimbabwe
1701 Zimbabwe
1702 Zimbabwe
1703 Zimbabwe
Name: country, Length: 1704, dtype: object
In [12]:
print(country_df.tail()) # 뒤에 5개
country_df.head() # 앞에 5개
1699 Zimbabwe
1700 Zimbabwe
1701 Zimbabwe
1702 Zimbabwe
1703 Zimbabwe
Name: country, dtype: object
Out[12]:
0 Afghanistan
1 Afghanistan
2 Afghanistan
3 Afghanistan
4 Afghanistan
Name: country, dtype: object
In [13]:
subset = df[['country', 'continent', 'year']]
print(subset)
print(type(subset))
country continent year
0 Afghanistan Asia 1952
1 Afghanistan Asia 1957
2 Afghanistan Asia 1962
3 Afghanistan Asia 1967
4 Afghanistan Asia 1972
... ... ... ...
1699 Zimbabwe Africa 1987
1700 Zimbabwe Africa 1992
1701 Zimbabwe Africa 1997
1702 Zimbabwe Africa 2002
1703 Zimbabwe Africa 2007
[1704 rows x 3 columns]
<class 'pandas.core.frame.DataFrame'>
In [14]:
print(subset.head())
subset.tail()
country continent year
0 Afghanistan Asia 1952
1 Afghanistan Asia 1957
2 Afghanistan Asia 1962
3 Afghanistan Asia 1967
4 Afghanistan Asia 1972
Out[14]:
country | continent | year | |
---|---|---|---|
1699 | Zimbabwe | Africa | 1987 |
1700 | Zimbabwe | Africa | 1992 |
1701 | Zimbabwe | Africa | 1997 |
1702 | Zimbabwe | Africa | 2002 |
1703 | Zimbabwe | Africa | 2007 |
3) loc 속성으로 행 단위 데이터 추출¶
- loc은 데이터 프레임의 인덱스를 기준으로 데이터 추출
- iloc은 데이터 순서를 의미하는 행 번호를 기준으로 데이터 추출
In [15]:
df.head()
Out[15]:
country | continent | year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.445314 |
1 | Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.853030 |
2 | Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.100710 |
3 | Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.197138 |
4 | Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.981106 |
In [16]:
df.loc[0]
Out[16]:
country Afghanistan
continent Asia
year 1952
lifeExp 28.801
pop 8425333
gdpPercap 779.445314
Name: 0, dtype: object
In [17]:
df.loc[99] # 100번째 데이터
Out[17]:
country Bangladesh
continent Asia
year 1967
lifeExp 43.453
pop 62821884
gdpPercap 721.186086
Name: 99, dtype: object
In [18]:
# df.loc[-1] # 인덱스에 없는 값을 사용하면 오류 발생
In [19]:
# 마지막 행의 인덱스 추출
number_of_rows = df.shape[0] # 행 출력
print(number_of_rows)
last_row_index = number_of_rows - 1 # 인덱스는 0으로 시작하기 때문에 -1
print(df.loc[last_row_index]) # 마지막 인덱스 출력
1704
country Zimbabwe
continent Africa
year 2007
lifeExp 43.487
pop 12311143
gdpPercap 469.709298
Name: 1703, dtype: object
In [20]:
print(df.tail(n=1))
country continent year lifeExp pop gdpPercap
1703 Zimbabwe Africa 2007 43.487 12311143 469.709298
In [21]:
print(df.tail(3))
country continent year lifeExp pop gdpPercap
1701 Zimbabwe Africa 1997 46.809 11404948 792.449960
1702 Zimbabwe Africa 2002 39.989 11926563 672.038623
1703 Zimbabwe Africa 2007 43.487 12311143 469.709298
In [22]:
df.loc[[0,99,999]]
Out[22]:
country | continent | year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.445314 |
99 | Bangladesh | Asia | 1967 | 43.453 | 62821884 | 721.186086 |
999 | Mongolia | Asia | 1967 | 51.253 | 1149500 | 1226.041130 |
4) tail 메서드와 loc 속성이 반환하는 자료형은 다름¶
In [23]:
subset_loc = df.loc[0]
subset_tail = df.tail(n=1)
print(subset_loc)
print(subset_tail)
print(type(subset_loc))
print(type(subset_tail))
country Afghanistan
continent Asia
year 1952
lifeExp 28.801
pop 8425333
gdpPercap 779.445314
Name: 0, dtype: object
country continent year lifeExp pop gdpPercap
1703 Zimbabwe Africa 2007 43.487 12311143 469.709298
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
5) iloc 속성으로 행 데이터 추출¶
In [24]:
df.head()
Out[24]:
country | continent | year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.445314 |
1 | Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.853030 |
2 | Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.100710 |
3 | Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.197138 |
4 | Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.981106 |
In [25]:
print(df.iloc[1])
country Afghanistan
continent Asia
year 1957
lifeExp 30.332
pop 9240934
gdpPercap 820.85303
Name: 1, dtype: object
In [26]:
print(df.iloc[99])
country Bangladesh
continent Asia
year 1967
lifeExp 43.453
pop 62821884
gdpPercap 721.186086
Name: 99, dtype: object
In [27]:
print(df.iloc[-1]) # iloc 속성은 음수를 사용해도 데이터를 추출 가능
# -1을 전달 마지막 행 데이터를 추출
country Zimbabwe
continent Africa
year 2007
lifeExp 43.487
pop 12311143
gdpPercap 469.709298
Name: 1703, dtype: object
In [28]:
# iloc도 loc 속성처럼 원하는 데이터의 행 번호를 리스트에 담아 전달하면 됨
df.iloc[[0,99,999]]
Out[28]:
country | continent | year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.445314 |
99 | Bangladesh | Asia | 1967 | 43.453 | 62821884 | 721.186086 |
999 | Mongolia | Asia | 1967 | 51.253 | 1149500 | 1226.041130 |
In [29]:
# loc : 문자열 리스트 전달
subset = df.loc[:, ['year', 'pop']]
subset.head()
Out[29]:
year | pop | |
---|---|---|
0 | 1952 | 8425333 |
1 | 1957 | 9240934 |
2 | 1962 | 10267083 |
3 | 1967 | 11537966 |
4 | 1972 | 13079460 |
In [30]:
# iloc : 정수 리스트 전달
subset = df.iloc[:, [2, 4, -1]]
subset.head()
Out[30]:
year | pop | gdpPercap | |
---|---|---|---|
0 | 1952 | 8425333 | 779.445314 |
1 | 1957 | 9240934 | 820.853030 |
2 | 1962 | 10267083 | 853.100710 |
3 | 1967 | 11537966 | 836.197138 |
4 | 1972 | 13079460 | 739.981106 |
6-2) range 메서드로 원하는 데이터 추출하기¶
In [31]:
small_range = list(range(5))
print(small_range)
[0, 1, 2, 3, 4]
In [32]:
print(type(small_range))
<class 'list'>
In [33]:
print(df.head())
subset = df.iloc[:, small_range]
print(subset)
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106
country continent year lifeExp pop
0 Afghanistan Asia 1952 28.801 8425333
1 Afghanistan Asia 1957 30.332 9240934
2 Afghanistan Asia 1962 31.997 10267083
3 Afghanistan Asia 1967 34.020 11537966
4 Afghanistan Asia 1972 36.088 13079460
... ... ... ... ... ...
1699 Zimbabwe Africa 1987 62.351 9216418
1700 Zimbabwe Africa 1992 60.377 10704340
1701 Zimbabwe Africa 1997 46.809 11404948
1702 Zimbabwe Africa 2002 39.989 11926563
1703 Zimbabwe Africa 2007 43.487 12311143
[1704 rows x 5 columns]
In [34]:
small_range = list(range(3,6))
print(small_range)
[3, 4, 5]
In [35]:
subset = df.iloc[:, small_range]
print(subset.head())
lifeExp pop gdpPercap
0 28.801 8425333 779.445314
1 30.332 9240934 820.853030
2 31.997 10267083 853.100710
3 34.020 11537966 836.197138
4 36.088 13079460 739.981106
6-3) 0~5까지 2만큼 건너뛰는 제네레이터 생성¶
In [36]:
df.head()
Out[36]:
country | continent | year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.445314 |
1 | Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.853030 |
2 | Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.100710 |
3 | Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.197138 |
4 | Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.981106 |
In [37]:
small_range = list(range(0,6,2))
print(small_range)
subset = df.iloc[:, small_range]
print(subset.head())
[0, 2, 4]
country year pop
0 Afghanistan 1952 8425333
1 Afghanistan 1957 9240934
2 Afghanistan 1962 10267083
3 Afghanistan 1967 11537966
4 Afghanistan 1972 13079460
6-4) 슬라이싱, range 메서드 비교하기¶
In [38]:
subset = df.iloc[:, :3] # list(range(3))과 동일
print(subset.head())
country continent year
0 Afghanistan Asia 1952
1 Afghanistan Asia 1957
2 Afghanistan Asia 1962
3 Afghanistan Asia 1967
4 Afghanistan Asia 1972
In [39]:
subset = df.iloc[:, 0:6:2] # list(range(0,6,2))과 동일
print(subset.head())
country year pop
0 Afghanistan 1952 8425333
1 Afghanistan 1957 9240934
2 Afghanistan 1962 10267083
3 Afghanistan 1967 11537966
4 Afghanistan 1972 13079460
6-5) loc, iloc 속성 자유자재로 사용하기¶
In [40]:
# 0, 99, 999행의 0, 3, 5번째 열 데이터 추출
print(df.iloc[[0,99,999], [0,3,5]])
country lifeExp gdpPercap
0 Afghanistan 28.801 779.445314
99 Bangladesh 43.453 721.186086
999 Mongolia 51.253 1226.041130
In [41]:
# 0, 99, 999행의 'country', 'lifeExp', 'gdpPercap' 열 데이터 추출
print(df.loc[[0,9,999], ['country', 'lifeExp', 'gdpPercap']])
country lifeExp gdpPercap
0 Afghanistan 28.801 779.445314
9 Afghanistan 41.763 635.341351
999 Mongolia 51.253 1226.041130
In [42]:
# 10~13 행의 'country', 'lifeExp', 'gdpPercap' 열 데이터 추출
print(df.loc[10:13, ['country', 'lifeExp', 'gdpPercap']])
country lifeExp gdpPercap
10 Afghanistan 42.129 726.734055
11 Afghanistan 43.828 974.580338
12 Albania 55.230 1601.056136
13 Albania 59.280 1942.284244
7) 그룹화한 데이터 평균 구하기¶
7-1) lifeExp 열을 연도별로 그룹화하여 평균 계산하기¶
In [43]:
print(df.head(n=10))
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106
5 Afghanistan Asia 1977 38.438 14880372 786.113360
6 Afghanistan Asia 1982 39.854 12881816 978.011439
7 Afghanistan Asia 1987 40.822 13867957 852.395945
8 Afghanistan Asia 1992 41.674 16317921 649.341395
9 Afghanistan Asia 1997 41.763 22227415 635.341351
In [44]:
print(df.groupby('year')['lifeExp'].mean()) # 연도에 따른 기대수명 평균
year
1952 49.057620
1957 51.507401
1962 53.609249
1967 55.678290
1972 57.647386
1977 59.570157
1982 61.533197
1987 63.212613
1992 64.160338
1997 65.014676
2002 65.694923
2007 67.007423
Name: lifeExp, dtype: float64
In [45]:
grouped_year_df = df.groupby('year')
print(type(grouped_year_df))
<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
In [46]:
# 연도별로 그룹화한 데이터는 df 형태로 메모리 (0x~~)에 저장되어 있음
print(grouped_year_df)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000027343CA5990>
In [47]:
# 그룹화한 df에서 lifeExp 열을 추출하면 그룹화한 시리즈를 얻을 수 있음
grouped_year_df_lifeExp = grouped_year_df['lifeExp']
print(grouped_year_df_lifeExp)
<pandas.core.groupby.generic.SeriesGroupBy object at 0x0000027343CA4B50>
In [48]:
mean_lifeExp_by_year = grouped_year_df_lifeExp.mean()
print(mean_lifeExp_by_year)
year
1952 49.057620
1957 51.507401
1962 53.609249
1967 55.678290
1972 57.647386
1977 59.570157
1982 61.533197
1987 63.212613
1992 64.160338
1997 65.014676
2002 65.694923
2007 67.007423
Name: lifeExp, dtype: float64
7-2) lifeExp, gdpPercap 열의 평균값을 연도, 지역별로 그룹화하여 계산¶
In [49]:
multi_group_var = df.groupby(['year', 'continent'])[['lifeExp', 'gdpPercap']].mean()
In [50]:
multi_group_var
Out[50]:
lifeExp | gdpPercap | ||
---|---|---|---|
year | continent | ||
1952 | Africa | 39.135500 | 1252.572466 |
Americas | 53.279840 | 4079.062552 | |
Asia | 46.314394 | 5195.484004 | |
Europe | 64.408500 | 5661.057435 | |
Oceania | 69.255000 | 10298.085650 | |
1957 | Africa | 41.266346 | 1385.236062 |
Americas | 55.960280 | 4616.043733 | |
Asia | 49.318544 | 5787.732940 | |
Europe | 66.703067 | 6963.012816 | |
Oceania | 70.295000 | 11598.522455 | |
1962 | Africa | 43.319442 | 1598.078825 |
Americas | 58.398760 | 4901.541870 | |
Asia | 51.563223 | 5729.369625 | |
Europe | 68.539233 | 8365.486814 | |
Oceania | 71.085000 | 12696.452430 | |
1967 | Africa | 45.334538 | 2050.363801 |
Americas | 60.410920 | 5668.253496 | |
Asia | 54.663640 | 5971.173374 | |
Europe | 69.737600 | 10143.823757 | |
Oceania | 71.310000 | 14495.021790 | |
1972 | Africa | 47.450942 | 2339.615674 |
Americas | 62.394920 | 6491.334139 | |
Asia | 57.319269 | 8187.468699 | |
Europe | 70.775033 | 12479.575246 | |
Oceania | 71.910000 | 16417.333380 | |
1977 | Africa | 49.580423 | 2585.938508 |
Americas | 64.391560 | 7352.007126 | |
Asia | 59.610556 | 7791.314020 | |
Europe | 71.937767 | 14283.979110 | |
Oceania | 72.855000 | 17283.957605 | |
1982 | Africa | 51.592865 | 2481.592960 |
Americas | 66.228840 | 7506.737088 | |
Asia | 62.617939 | 7434.135157 | |
Europe | 72.806400 | 15617.896551 | |
Oceania | 74.290000 | 18554.709840 | |
1987 | Africa | 53.344788 | 2282.668991 |
Americas | 68.090720 | 7793.400261 | |
Asia | 64.851182 | 7608.226508 | |
Europe | 73.642167 | 17214.310727 | |
Oceania | 75.320000 | 20448.040160 | |
1992 | Africa | 53.629577 | 2281.810333 |
Americas | 69.568360 | 8044.934406 | |
Asia | 66.537212 | 8639.690248 | |
Europe | 74.440100 | 17061.568084 | |
Oceania | 76.945000 | 20894.045885 | |
1997 | Africa | 53.598269 | 2378.759555 |
Americas | 71.150480 | 8889.300863 | |
Asia | 68.020515 | 9834.093295 | |
Europe | 75.505167 | 19076.781802 | |
Oceania | 78.190000 | 24024.175170 | |
2002 | Africa | 53.325231 | 2599.385159 |
Americas | 72.422040 | 9287.677107 | |
Asia | 69.233879 | 10174.090397 | |
Europe | 76.700600 | 21711.732422 | |
Oceania | 79.740000 | 26938.778040 | |
2007 | Africa | 54.806038 | 3089.032605 |
Americas | 73.608120 | 11003.031625 | |
Asia | 70.728485 | 12473.026870 | |
Europe | 77.648600 | 25054.481636 | |
Oceania | 80.719500 | 29810.188275 |
7-3) 그룹화한 데이터의 개수 세기¶
In [51]:
print(df.groupby('continent')['country'].nunique()) # 빈도수
continent
Africa 52
Americas 25
Asia 33
Europe 30
Oceania 2
Name: country, dtype: int64
7-4) 그래프 그리기¶
In [72]:
%matplotlib
Using matplotlib backend: QtAgg
In [73]:
import matplotlib.pyplot as plt
aa = df.groupby('year')['lifeExp'].mean()
print(aa)
year
1952 49.057620
1957 51.507401
1962 53.609249
1967 55.678290
1972 57.647386
1977 59.570157
1982 61.533197
1987 63.212613
1992 64.160338
1997 65.014676
2002 65.694923
2007 67.007423
Name: lifeExp, dtype: float64
In [74]:
aa.plot()
Out[74]:
<Axes: xlabel='year'>
In [79]:
multi_group_var.plot()
Out[79]:
<Axes: xlabel='year,continent'>
반응형
'데이터분석' 카테고리의 다른 글
[23.06.12] Python Series, DataFrame - 07(4) (0) | 2023.06.12 |
---|---|
[23.06.12] Python pandas - 07(3) (0) | 2023.06.12 |
[23.06.12] Python pandas - 07(1) (0) | 2023.06.12 |
[23.06.09] Python 객체 지향, 모듈 (야구 게임 만들기) - 06(2) (0) | 2023.06.09 |
[23.06.09] Python 객체 지향, 모듈 - 06(1) (2) | 2023.06.09 |