시리즈와 불린 추출 사용¶
In [1]:
import pandas as pd
In [5]:
scientists = pd.read_csv('../../data/scientists.csv')
scientists.head()
Out[5]:
Name | Born | Died | Age | Occupation | |
---|---|---|---|---|---|
0 | Rosaline Franklin | 1920-07-25 | 1958-04-16 | 37 | Chemist |
1 | William Gosset | 1876-06-13 | 1937-10-16 | 61 | Statistician |
2 | Florence Nightingale | 1820-05-12 | 1910-08-13 | 90 | Nurse |
3 | Marie Curie | 1867-11-07 | 1934-07-04 | 66 | Chemist |
4 | Rachel Carson | 1907-05-27 | 1964-04-14 | 56 | Biologist |
In [8]:
ages = scientists['Age']
print(ages)
print(ages.max())
0 37
1 61
2 90
3 66
4 56
5 45
6 41
7 77
Name: Age, dtype: int64
90
In [18]:
print(ages.mean())
59.125
In [17]:
# 평균 나이보다 나이가 많은 사람의 데이터만 추출
ages[ages > ages.mean()]
Out[17]:
1 61
2 90
3 66
7 77
Name: Age, dtype: int64
In [19]:
# 1, 2, 3, 7 인덱스의 데이터가 True
ages > ages.mean()
Out[19]:
0 False
1 True
2 True
3 True
4 False
5 False
6 False
7 True
Name: Age, dtype: bool
In [20]:
# 불린 추출
# 리스트 형태로 참, 거짓을 담아 시리즈에 전달하면 참인 인덱스의 데이터만 추출
manual_bool_values = [True, True, False, False, True, True, False, True]
print(ages[manual_bool_values])
0 37
1 61
4 56
5 45
7 77
Name: Age, dtype: int64
In [24]:
# 나이에 따른 내림차순
scientists.sort_values(by='Age', ascending=False).head()
Out[24]:
Name | Born | Died | Age | Occupation | |
---|---|---|---|---|---|
2 | Florence Nightingale | 1820-05-12 | 1910-08-13 | 90 | Nurse |
7 | Johann Gauss | 1777-04-30 | 1855-02-23 | 77 | Mathematician |
3 | Marie Curie | 1867-11-07 | 1934-07-04 | 66 | Chemist |
1 | William Gosset | 1876-06-13 | 1937-10-16 | 61 | Statistician |
4 | Rachel Carson | 1907-05-27 | 1964-04-14 | 56 | Biologist |
In [25]:
# 나이에 따른 오름차순
scientists.sort_values(by='Age', ascending=True).head()
Out[25]:
Name | Born | Died | Age | Occupation | |
---|---|---|---|---|---|
0 | Rosaline Franklin | 1920-07-25 | 1958-04-16 | 37 | Chemist |
6 | Alan Turing | 1912-06-23 | 1954-06-07 | 41 | Computer Scientist |
5 | John Snow | 1813-03-15 | 1858-06-16 | 45 | Physician |
4 | Rachel Carson | 1907-05-27 | 1964-04-14 | 56 | Biologist |
1 | William Gosset | 1876-06-13 | 1937-10-16 | 61 | Statistician |
벡터와 스칼라로 브로드캐스팅 수행하기¶
- 벡터: 시리즈처럼 여러 개의 값을 가진 데이터
- 스칼라: 단순 크기를 나타내는 데이터
In [30]:
# 같은 길이의 벡터로 더하기, 곱하기 연산을 수행 후 결과로 같은 길이의 벡터 추출
ages + ages
Out[30]:
0 74
1 122
2 180
3 132
4 112
5 90
6 82
7 154
Name: Age, dtype: int64
In [31]:
ages * ages
Out[31]:
0 1369
1 3721
2 8100
3 4356
4 3136
5 2025
6 1681
7 5929
Name: Age, dtype: int64
In [34]:
# 벡터의 모든 값에 스칼라 연산 - 브로드캐스팅 한 결과
ages + 100
Out[34]:
0 137
1 161
2 190
3 166
4 156
5 145
6 141
7 177
Name: Age, dtype: int64
In [35]:
# 길이가 서로 다른 벡터를 연산
pd.Series([1, 100])
Out[35]:
0 1
1 100
dtype: int64
In [37]:
# 시리즈와 시리즈를 연산하는 경우 같은 인덱스 값만 계산 -> 0, 1 인덱스
ages + pd.Series([1, 100])
Out[37]:
0 38.0
1 161.0
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
dtype: float64
In [40]:
# 인덱스 역순 정렬
rev_ages = ages.sort_index(ascending=False)
rev_ages
Out[40]:
7 77
6 41
5 45
4 56
3 66
2 90
1 61
0 37
Name: Age, dtype: int64
In [42]:
# ages의 인덱스 (0~7)와 rev_ages(0~7)의 인덱스가 일치하는 값끼리 연산
ages + rev_ages
Out[42]:
0 74
1 122
2 180
3 132
4 112
5 90
6 82
7 154
Name: Age, dtype: int64
In [46]:
scientists[scientists['Age'] > scientists['Age'].mean()]
Out[46]:
Name | Born | Died | Age | Occupation | |
---|---|---|---|---|---|
1 | William Gosset | 1876-06-13 | 1937-10-16 | 61 | Statistician |
2 | Florence Nightingale | 1820-05-12 | 1910-08-13 | 90 | Nurse |
3 | Marie Curie | 1867-11-07 | 1934-07-04 | 66 | Chemist |
7 | Johann Gauss | 1777-04-30 | 1855-02-23 | 77 | Mathematician |
In [50]:
# 인덱스가 2, 3, 4, 5, 6,인 행 데이터는 bool 값이 False라 출력 X
scientists.loc[[True,True,False,True,True,True,False,True]]
Out[50]:
Name | Born | Died | Age | Occupation | |
---|---|---|---|---|---|
0 | Rosaline Franklin | 1920-07-25 | 1958-04-16 | 37 | Chemist |
1 | William Gosset | 1876-06-13 | 1937-10-16 | 61 | Statistician |
3 | Marie Curie | 1867-11-07 | 1934-07-04 | 66 | Chemist |
4 | Rachel Carson | 1907-05-27 | 1964-04-14 | 56 | Biologist |
5 | John Snow | 1813-03-15 | 1858-06-16 | 45 | Physician |
7 | Johann Gauss | 1777-04-30 | 1855-02-23 | 77 | Mathematician |
In [52]:
# df 브로드캐스팅 -> 정수 데이터는 2를 곱한 숫자가 되고, 문자열 데이터는 문자열이 2배로 늘어남
scientists * 2
Out[52]:
Name | Born | Died | Age | Occupation | |
---|---|---|---|---|---|
0 | Rosaline FranklinRosaline Franklin | 1920-07-251920-07-25 | 1958-04-161958-04-16 | 74 | ChemistChemist |
1 | William GossetWilliam Gosset | 1876-06-131876-06-13 | 1937-10-161937-10-16 | 122 | StatisticianStatistician |
2 | Florence NightingaleFlorence Nightingale | 1820-05-121820-05-12 | 1910-08-131910-08-13 | 180 | NurseNurse |
3 | Marie CurieMarie Curie | 1867-11-071867-11-07 | 1934-07-041934-07-04 | 132 | ChemistChemist |
4 | Rachel CarsonRachel Carson | 1907-05-271907-05-27 | 1964-04-141964-04-14 | 112 | BiologistBiologist |
5 | John SnowJohn Snow | 1813-03-151813-03-15 | 1858-06-161858-06-16 | 90 | PhysicianPhysician |
6 | Alan TuringAlan Turing | 1912-06-231912-06-23 | 1954-06-071954-06-07 | 82 | Computer ScientistComputer Scientist |
7 | Johann GaussJohann Gauss | 1777-04-301777-04-30 | 1855-02-231855-02-23 | 154 | MathematicianMathematician |
시리즈와 데이터 프레임의 데이터 처리하기¶
In [58]:
# 날짜가 문자열로 저장
print(scientists['Born'].dtype)
print(scientists['Died'].dtype)
scientists['Born']
object
object
Out[58]:
0 1920-07-25
1 1876-06-13
2 1820-05-12
3 1867-11-07
4 1907-05-27
5 1813-03-15
6 1912-06-23
7 1777-04-30
Name: Born, dtype: object
In [66]:
# 시간 관련 작업을 할 수 있또록 datetime 자료형으로 변형
born_datatime = pd.to_datetime(scientists['Born'], format='%Y-%m-%d')
print(born_datatime)
type(born_datatime)
0 1920-07-25
1 1876-06-13
2 1820-05-12
3 1867-11-07
4 1907-05-27
5 1813-03-15
6 1912-06-23
7 1777-04-30
Name: Born, dtype: datetime64[ns]
Out[66]:
pandas.core.series.Series
In [67]:
died_datatime = pd.to_datetime(scientists['Died'], format='%Y-%m-%d')
print(died_datatime)
0 1958-04-16
1 1937-10-16
2 1910-08-13
3 1934-07-04
4 1964-04-14
5 1858-06-16
6 1954-06-07
7 1855-02-23
Name: Died, dtype: datetime64[ns]
In [68]:
scientists.head(n=3)
Out[68]:
Name | Born | Died | Age | Occupation | |
---|---|---|---|---|---|
0 | Rosaline Franklin | 1920-07-25 | 1958-04-16 | 37 | Chemist |
1 | William Gosset | 1876-06-13 | 1937-10-16 | 61 | Statistician |
2 | Florence Nightingale | 1820-05-12 | 1910-08-13 | 90 | Nurse |
In [80]:
scientists['born_dt'], scientists['died_dt'] = [born_datatime, died_datatime]
In [81]:
scientists.head()
Out[81]:
Name | Born | Died | Age | Occupation | born_dt | died_dt | |
---|---|---|---|---|---|---|---|
0 | Rosaline Franklin | 1920-07-25 | 1958-04-16 | 37 | Chemist | 1920-07-25 | 1958-04-16 |
1 | William Gosset | 1876-06-13 | 1937-10-16 | 61 | Statistician | 1876-06-13 | 1937-10-16 |
2 | Florence Nightingale | 1820-05-12 | 1910-08-13 | 90 | Nurse | 1820-05-12 | 1910-08-13 |
3 | Marie Curie | 1867-11-07 | 1934-07-04 | 66 | Chemist | 1867-11-07 | 1934-07-04 |
4 | Rachel Carson | 1907-05-27 | 1964-04-14 | 56 | Biologist | 1907-05-27 | 1964-04-14 |
In [84]:
type([born_datatime, died_datatime])
Out[84]:
list
In [87]:
scientists.shape
Out[87]:
(8, 7)
In [89]:
scientists['age_daty_dt'] = scientists['died_dt'] - scientists['born_dt']
scientists
Out[89]:
Name | Born | Died | Age | Occupation | born_dt | died_dt | age_daty_dt | |
---|---|---|---|---|---|---|---|---|
0 | Rosaline Franklin | 1920-07-25 | 1958-04-16 | 37 | Chemist | 1920-07-25 | 1958-04-16 | 13779 days |
1 | William Gosset | 1876-06-13 | 1937-10-16 | 61 | Statistician | 1876-06-13 | 1937-10-16 | 22404 days |
2 | Florence Nightingale | 1820-05-12 | 1910-08-13 | 90 | Nurse | 1820-05-12 | 1910-08-13 | 32964 days |
3 | Marie Curie | 1867-11-07 | 1934-07-04 | 66 | Chemist | 1867-11-07 | 1934-07-04 | 24345 days |
4 | Rachel Carson | 1907-05-27 | 1964-04-14 | 56 | Biologist | 1907-05-27 | 1964-04-14 | 20777 days |
5 | John Snow | 1813-03-15 | 1858-06-16 | 45 | Physician | 1813-03-15 | 1858-06-16 | 16529 days |
6 | Alan Turing | 1912-06-23 | 1954-06-07 | 41 | Computer Scientist | 1912-06-23 | 1954-06-07 | 15324 days |
7 | Johann Gauss | 1777-04-30 | 1855-02-23 | 77 | Mathematician | 1777-04-30 | 1855-02-23 | 28422 days |
시리즈, 데이터 프레임의 데이터 섞기¶
In [90]:
scientists['Age']
Out[90]:
0 37
1 61
2 90
3 66
4 56
5 45
6 41
7 77
Name: Age, dtype: int64
In [91]:
# seed 메서드는 컴퓨터가 생성하는 난수의 기준값을 정하기 위해 사용
import random
random.seed(42)
random.shuffle(scientists['Age'])
scientists['Age']
C:\ProgramData\anaconda3\lib\random.py:394: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
x[i], x[j] = x[j], x[i]
Out[91]:
0 66
1 56
2 41
3 77
4 90
5 45
6 37
7 61
Name: Age, dtype: int64
데이터 프레임의 열 삭제¶
In [92]:
scientists.columns
Out[92]:
Index(['Name', 'Born', 'Died', 'Age', 'Occupation', 'born_dt', 'died_dt',
'age_daty_dt'],
dtype='object')
In [98]:
scientists_dropped = scientists.drop(['Age'], axis=1)
scientists_dropped
Out[98]:
Name | Born | Died | Occupation | born_dt | died_dt | age_daty_dt | |
---|---|---|---|---|---|---|---|
0 | Rosaline Franklin | 1920-07-25 | 1958-04-16 | Chemist | 1920-07-25 | 1958-04-16 | 13779 days |
1 | William Gosset | 1876-06-13 | 1937-10-16 | Statistician | 1876-06-13 | 1937-10-16 | 22404 days |
2 | Florence Nightingale | 1820-05-12 | 1910-08-13 | Nurse | 1820-05-12 | 1910-08-13 | 32964 days |
3 | Marie Curie | 1867-11-07 | 1934-07-04 | Chemist | 1867-11-07 | 1934-07-04 | 24345 days |
4 | Rachel Carson | 1907-05-27 | 1964-04-14 | Biologist | 1907-05-27 | 1964-04-14 | 20777 days |
5 | John Snow | 1813-03-15 | 1858-06-16 | Physician | 1813-03-15 | 1858-06-16 | 16529 days |
6 | Alan Turing | 1912-06-23 | 1954-06-07 | Computer Scientist | 1912-06-23 | 1954-06-07 | 15324 days |
7 | Johann Gauss | 1777-04-30 | 1855-02-23 | Mathematician | 1777-04-30 | 1855-02-23 | 28422 days |
데이터를 피클, CSV, TSV, 파일로 저장 후 불러오기¶
1. 피클로 저장하기¶
- 바이너리 형태로 스프레드시트 보다 작은 용량으로 데이터를 저장
- 오래 보관한다는 의미로 붙여진 이름
In [99]:
names = scientists['Name']
names
Out[99]:
0 Rosaline Franklin
1 William Gosset
2 Florence Nightingale
3 Marie Curie
4 Rachel Carson
5 John Snow
6 Alan Turing
7 Johann Gauss
Name: Name, dtype: object
In [100]:
names.to_pickle('../../output/scientists_names_series.pickle')
In [101]:
scientists.to_pickle('../../output/scientists_df.pickle')
In [102]:
scientists_names_from_pickle = pd.read_pickle('../../output/scientists_names_series.pickle')
print(scientists_names_from_pickle)
0 Rosaline Franklin
1 William Gosset
2 Florence Nightingale
3 Marie Curie
4 Rachel Carson
5 John Snow
6 Alan Turing
7 Johann Gauss
Name: Name, dtype: object
In [103]:
sc = pd.read_pickle('../../output/scientists_df.pickle')
print(sc)
Name Born Died Age Occupation \
0 Rosaline Franklin 1920-07-25 1958-04-16 66 Chemist
1 William Gosset 1876-06-13 1937-10-16 56 Statistician
2 Florence Nightingale 1820-05-12 1910-08-13 41 Nurse
3 Marie Curie 1867-11-07 1934-07-04 77 Chemist
4 Rachel Carson 1907-05-27 1964-04-14 90 Biologist
5 John Snow 1813-03-15 1858-06-16 45 Physician
6 Alan Turing 1912-06-23 1954-06-07 37 Computer Scientist
7 Johann Gauss 1777-04-30 1855-02-23 61 Mathematician
born_dt died_dt age_daty_dt
0 1920-07-25 1958-04-16 13779 days
1 1876-06-13 1937-10-16 22404 days
2 1820-05-12 1910-08-13 32964 days
3 1867-11-07 1934-07-04 24345 days
4 1907-05-27 1964-04-14 20777 days
5 1813-03-15 1858-06-16 16529 days
6 1912-06-23 1954-06-07 15324 days
7 1777-04-30 1855-02-23 28422 days
2. CSV, TSV 파일로 저장하기¶
In [104]:
names.to_csv('../../output/names_series.csv')
In [105]:
scientists.to_csv('../../output/scintists_df.tsv', sep='\n')
In [106]:
scientists.to_csv('../../output/scintists_df_no_index.tsv', index=False)
In [108]:
names
Out[108]:
0 Rosaline Franklin
1 William Gosset
2 Florence Nightingale
3 Marie Curie
4 Rachel Carson
5 John Snow
6 Alan Turing
7 Johann Gauss
Name: Name, dtype: object
In [112]:
# 시리즈는 엑셀과 구조가 맞지 않기 때문에 엑셀 파일로 저장할 수 없음
# 엑셀 파일로 저장할 수 있는 df 구조로 변환
names_df = names.to_frame() # 시리즈를 데이터 프레임으로 변환
import xlwt
names_df.to_excel('../../output/scientists_names_series_df.xls')
import openpyxl
names_df.to_excel('../../output/scientists_names_series_df.xlsx')
C:\Users\Playdata\AppData\Local\Temp\ipykernel_8792\578425899.py:6: FutureWarning: As the xlwt package is no longer maintained, the xlwt engine will be removed in a future version of pandas. This is the only engine in pandas that supports writing in the xls format. Install openpyxl and write to an xlsx file instead. You can set the option io.excel.xls.writer to 'xlwt' to silence this warning. While this option is deprecated and will also raise a warning, it can be globally set and the warning suppressed.
names_df.to_excel('../../output/scientists_names_series_df.xls')
In [ ]:
반응형
'데이터분석' 카테고리의 다른 글
[23.06.14] Python Seaborn - 09(1) (0) | 2023.06.14 |
---|---|
[23.06.13] Python Series, DataFrame - 08(4) (0) | 2023.06.13 |
[23.06.13] Python 클래스 - 08(2) (0) | 2023.06.13 |
[23.06.13] Python 클래스 - 08(1) (0) | 2023.06.13 |
[23.06.12] Python Series, DataFrame 문제 - 07(5) (0) | 2023.06.12 |