시리즈와 불린 추출 사용¶

In [1]:

import pandas as pd

In [5]:

scientists = pd.read_csv('../../data/scientists.csv')
scientists.head()

Out[5]:

	Name	Born	Died	Age	Occupation
0	Rosaline Franklin	1920-07-25	1958-04-16	37	Chemist
1	William Gosset	1876-06-13	1937-10-16	61	Statistician
2	Florence Nightingale	1820-05-12	1910-08-13	90	Nurse
3	Marie Curie	1867-11-07	1934-07-04	66	Chemist
4	Rachel Carson	1907-05-27	1964-04-14	56	Biologist

In [8]:

ages = scientists['Age']
print(ages)
print(ages.max())

0    37
1    61
2    90
3    66
4    56
5    45
6    41
7    77
Name: Age, dtype: int64
90

In [18]:

print(ages.mean())

59.125

In [17]:

# 평균 나이보다 나이가 많은 사람의 데이터만 추출
ages[ages > ages.mean()]

Out[17]:

1    61
2    90
3    66
7    77
Name: Age, dtype: int64

In [19]:

# 1, 2, 3, 7 인덱스의 데이터가 True
ages > ages.mean()

Out[19]:

0    False
1     True
2     True
3     True
4    False
5    False
6    False
7     True
Name: Age, dtype: bool

In [20]:

# 불린 추출
# 리스트 형태로 참, 거짓을 담아 시리즈에 전달하면 참인 인덱스의 데이터만 추출
manual_bool_values = [True, True, False, False, True, True, False, True]
print(ages[manual_bool_values])

0    37
1    61
4    56
5    45
7    77
Name: Age, dtype: int64

In [24]:

# 나이에 따른 내림차순
scientists.sort_values(by='Age', ascending=False).head()

Out[24]:

	Name	Born	Died	Age	Occupation
2	Florence Nightingale	1820-05-12	1910-08-13	90	Nurse
7	Johann Gauss	1777-04-30	1855-02-23	77	Mathematician
3	Marie Curie	1867-11-07	1934-07-04	66	Chemist
1	William Gosset	1876-06-13	1937-10-16	61	Statistician
4	Rachel Carson	1907-05-27	1964-04-14	56	Biologist

In [25]:

# 나이에 따른 오름차순
scientists.sort_values(by='Age', ascending=True).head()

Out[25]:

	Name	Born	Died	Age	Occupation
0	Rosaline Franklin	1920-07-25	1958-04-16	37	Chemist
6	Alan Turing	1912-06-23	1954-06-07	41	Computer Scientist
5	John Snow	1813-03-15	1858-06-16	45	Physician
4	Rachel Carson	1907-05-27	1964-04-14	56	Biologist
1	William Gosset	1876-06-13	1937-10-16	61	Statistician

벡터와 스칼라로 브로드캐스팅 수행하기¶

벡터: 시리즈처럼 여러 개의 값을 가진 데이터
스칼라: 단순 크기를 나타내는 데이터

In [30]:

# 같은 길이의 벡터로 더하기, 곱하기 연산을 수행 후 결과로 같은 길이의 벡터 추출
ages + ages

Out[30]:

0     74
1    122
2    180
3    132
4    112
5     90
6     82
7    154
Name: Age, dtype: int64

In [31]:

ages * ages

Out[31]:

0    1369
1    3721
2    8100
3    4356
4    3136
5    2025
6    1681
7    5929
Name: Age, dtype: int64

In [34]:

# 벡터의 모든 값에 스칼라 연산 - 브로드캐스팅 한 결과
ages + 100

Out[34]:

0    137
1    161
2    190
3    166
4    156
5    145
6    141
7    177
Name: Age, dtype: int64

In [35]:

# 길이가 서로 다른 벡터를 연산
pd.Series([1, 100])

Out[35]:

0      1
1    100
dtype: int64

In [37]:

# 시리즈와 시리즈를 연산하는 경우 같은 인덱스 값만 계산 -> 0, 1 인덱스
ages + pd.Series([1, 100])

Out[37]:

0     38.0
1    161.0
2      NaN
3      NaN
4      NaN
5      NaN
6      NaN
7      NaN
dtype: float64

In [40]:

# 인덱스 역순 정렬
rev_ages = ages.sort_index(ascending=False)
rev_ages

Out[40]:

7    77
6    41
5    45
4    56
3    66
2    90
1    61
0    37
Name: Age, dtype: int64

In [42]:

# ages의 인덱스 (0~7)와 rev_ages(0~7)의 인덱스가 일치하는 값끼리 연산
ages + rev_ages

Out[42]:

0     74
1    122
2    180
3    132
4    112
5     90
6     82
7    154
Name: Age, dtype: int64

In [46]:

scientists[scientists['Age'] > scientists['Age'].mean()]

Out[46]:

	Name	Born	Died	Age	Occupation
1	William Gosset	1876-06-13	1937-10-16	61	Statistician
2	Florence Nightingale	1820-05-12	1910-08-13	90	Nurse
3	Marie Curie	1867-11-07	1934-07-04	66	Chemist
7	Johann Gauss	1777-04-30	1855-02-23	77	Mathematician

In [50]:

# 인덱스가 2, 3, 4, 5, 6,인 행 데이터는 bool 값이 False라 출력 X
scientists.loc[[True,True,False,True,True,True,False,True]]

Out[50]:

	Name	Born	Died	Age	Occupation
0	Rosaline Franklin	1920-07-25	1958-04-16	37	Chemist
1	William Gosset	1876-06-13	1937-10-16	61	Statistician
3	Marie Curie	1867-11-07	1934-07-04	66	Chemist
4	Rachel Carson	1907-05-27	1964-04-14	56	Biologist
5	John Snow	1813-03-15	1858-06-16	45	Physician
7	Johann Gauss	1777-04-30	1855-02-23	77	Mathematician

In [52]:

# df 브로드캐스팅 -> 정수 데이터는 2를 곱한 숫자가 되고, 문자열 데이터는 문자열이 2배로 늘어남
scientists * 2

Out[52]:

	Name	Born	Died	Age	Occupation
0	Rosaline FranklinRosaline Franklin	1920-07-251920-07-25	1958-04-161958-04-16	74	ChemistChemist
1	William GossetWilliam Gosset	1876-06-131876-06-13	1937-10-161937-10-16	122	StatisticianStatistician
2	Florence NightingaleFlorence Nightingale	1820-05-121820-05-12	1910-08-131910-08-13	180	NurseNurse
3	Marie CurieMarie Curie	1867-11-071867-11-07	1934-07-041934-07-04	132	ChemistChemist
4	Rachel CarsonRachel Carson	1907-05-271907-05-27	1964-04-141964-04-14	112	BiologistBiologist
5	John SnowJohn Snow	1813-03-151813-03-15	1858-06-161858-06-16	90	PhysicianPhysician
6	Alan TuringAlan Turing	1912-06-231912-06-23	1954-06-071954-06-07	82	Computer ScientistComputer Scientist
7	Johann GaussJohann Gauss	1777-04-301777-04-30	1855-02-231855-02-23	154	MathematicianMathematician

시리즈와 데이터 프레임의 데이터 처리하기¶

In [58]:

# 날짜가 문자열로 저장
print(scientists['Born'].dtype)
print(scientists['Died'].dtype)
scientists['Born']

object
object

Out[58]:

0    1920-07-25
1    1876-06-13
2    1820-05-12
3    1867-11-07
4    1907-05-27
5    1813-03-15
6    1912-06-23
7    1777-04-30
Name: Born, dtype: object

In [66]:

# 시간 관련 작업을 할 수 있또록 datetime 자료형으로 변형
born_datatime = pd.to_datetime(scientists['Born'], format='%Y-%m-%d')
print(born_datatime)
type(born_datatime)

0   1920-07-25
1   1876-06-13
2   1820-05-12
3   1867-11-07
4   1907-05-27
5   1813-03-15
6   1912-06-23
7   1777-04-30
Name: Born, dtype: datetime64[ns]

Out[66]:

pandas.core.series.Series

In [67]:

died_datatime = pd.to_datetime(scientists['Died'], format='%Y-%m-%d')
print(died_datatime)

0   1958-04-16
1   1937-10-16
2   1910-08-13
3   1934-07-04
4   1964-04-14
5   1858-06-16
6   1954-06-07
7   1855-02-23
Name: Died, dtype: datetime64[ns]

In [68]:

scientists.head(n=3)

Out[68]:

	Name	Born	Died	Age	Occupation
0	Rosaline Franklin	1920-07-25	1958-04-16	37	Chemist
1	William Gosset	1876-06-13	1937-10-16	61	Statistician
2	Florence Nightingale	1820-05-12	1910-08-13	90	Nurse

In [80]:

scientists['born_dt'], scientists['died_dt'] = [born_datatime, died_datatime]

In [81]:

scientists.head()

Out[81]:

	Name	Born	Died	Age	Occupation	born_dt	died_dt
0	Rosaline Franklin	1920-07-25	1958-04-16	37	Chemist	1920-07-25	1958-04-16
1	William Gosset	1876-06-13	1937-10-16	61	Statistician	1876-06-13	1937-10-16
2	Florence Nightingale	1820-05-12	1910-08-13	90	Nurse	1820-05-12	1910-08-13
3	Marie Curie	1867-11-07	1934-07-04	66	Chemist	1867-11-07	1934-07-04
4	Rachel Carson	1907-05-27	1964-04-14	56	Biologist	1907-05-27	1964-04-14

In [84]:

type([born_datatime, died_datatime])

Out[84]:

list

In [87]:

scientists.shape

Out[87]:

(8, 7)

In [89]:

scientists['age_daty_dt'] = scientists['died_dt'] - scientists['born_dt']
scientists

Out[89]:

	Name	Born	Died	Age	Occupation	born_dt	died_dt	age_daty_dt
0	Rosaline Franklin	1920-07-25	1958-04-16	37	Chemist	1920-07-25	1958-04-16	13779 days
1	William Gosset	1876-06-13	1937-10-16	61	Statistician	1876-06-13	1937-10-16	22404 days
2	Florence Nightingale	1820-05-12	1910-08-13	90	Nurse	1820-05-12	1910-08-13	32964 days
3	Marie Curie	1867-11-07	1934-07-04	66	Chemist	1867-11-07	1934-07-04	24345 days
4	Rachel Carson	1907-05-27	1964-04-14	56	Biologist	1907-05-27	1964-04-14	20777 days
5	John Snow	1813-03-15	1858-06-16	45	Physician	1813-03-15	1858-06-16	16529 days
6	Alan Turing	1912-06-23	1954-06-07	41	Computer Scientist	1912-06-23	1954-06-07	15324 days
7	Johann Gauss	1777-04-30	1855-02-23	77	Mathematician	1777-04-30	1855-02-23	28422 days

시리즈, 데이터 프레임의 데이터 섞기¶

In [90]:

scientists['Age']

Out[90]:

0    37
1    61
2    90
3    66
4    56
5    45
6    41
7    77
Name: Age, dtype: int64

In [91]:

# seed 메서드는 컴퓨터가 생성하는 난수의 기준값을 정하기 위해 사용

import random

random.seed(42)
random.shuffle(scientists['Age'])
scientists['Age']

C:\ProgramData\anaconda3\lib\random.py:394: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x[i], x[j] = x[j], x[i]

Out[91]:

0    66
1    56
2    41
3    77
4    90
5    45
6    37
7    61
Name: Age, dtype: int64

데이터 프레임의 열 삭제¶

In [92]:

scientists.columns

Out[92]:

Index(['Name', 'Born', 'Died', 'Age', 'Occupation', 'born_dt', 'died_dt',
       'age_daty_dt'],
      dtype='object')

In [98]:

scientists_dropped = scientists.drop(['Age'], axis=1)
scientists_dropped

Out[98]:

	Name	Born	Died	Occupation	born_dt	died_dt	age_daty_dt
0	Rosaline Franklin	1920-07-25	1958-04-16	Chemist	1920-07-25	1958-04-16	13779 days
1	William Gosset	1876-06-13	1937-10-16	Statistician	1876-06-13	1937-10-16	22404 days
2	Florence Nightingale	1820-05-12	1910-08-13	Nurse	1820-05-12	1910-08-13	32964 days
3	Marie Curie	1867-11-07	1934-07-04	Chemist	1867-11-07	1934-07-04	24345 days
4	Rachel Carson	1907-05-27	1964-04-14	Biologist	1907-05-27	1964-04-14	20777 days
5	John Snow	1813-03-15	1858-06-16	Physician	1813-03-15	1858-06-16	16529 days
6	Alan Turing	1912-06-23	1954-06-07	Computer Scientist	1912-06-23	1954-06-07	15324 days
7	Johann Gauss	1777-04-30	1855-02-23	Mathematician	1777-04-30	1855-02-23	28422 days

데이터를 피클, CSV, TSV, 파일로 저장 후 불러오기¶

1. 피클로 저장하기¶

바이너리 형태로 스프레드시트 보다 작은 용량으로 데이터를 저장
오래 보관한다는 의미로 붙여진 이름

In [99]:

names = scientists['Name']
names

Out[99]:

0       Rosaline Franklin
1          William Gosset
2    Florence Nightingale
3             Marie Curie
4           Rachel Carson
5               John Snow
6             Alan Turing
7            Johann Gauss
Name: Name, dtype: object

In [100]:

names.to_pickle('../../output/scientists_names_series.pickle')

In [101]:

scientists.to_pickle('../../output/scientists_df.pickle')

In [102]:

scientists_names_from_pickle = pd.read_pickle('../../output/scientists_names_series.pickle')
print(scientists_names_from_pickle)

0       Rosaline Franklin
1          William Gosset
2    Florence Nightingale
3             Marie Curie
4           Rachel Carson
5               John Snow
6             Alan Turing
7            Johann Gauss
Name: Name, dtype: object

In [103]:

sc = pd.read_pickle('../../output/scientists_df.pickle')
print(sc)

                   Name        Born        Died  Age          Occupation  \
0     Rosaline Franklin  1920-07-25  1958-04-16   66             Chemist   
1        William Gosset  1876-06-13  1937-10-16   56        Statistician   
2  Florence Nightingale  1820-05-12  1910-08-13   41               Nurse   
3           Marie Curie  1867-11-07  1934-07-04   77             Chemist   
4         Rachel Carson  1907-05-27  1964-04-14   90           Biologist   
5             John Snow  1813-03-15  1858-06-16   45           Physician   
6           Alan Turing  1912-06-23  1954-06-07   37  Computer Scientist   
7          Johann Gauss  1777-04-30  1855-02-23   61       Mathematician   

     born_dt    died_dt age_daty_dt  
0 1920-07-25 1958-04-16  13779 days  
1 1876-06-13 1937-10-16  22404 days  
2 1820-05-12 1910-08-13  32964 days  
3 1867-11-07 1934-07-04  24345 days  
4 1907-05-27 1964-04-14  20777 days  
5 1813-03-15 1858-06-16  16529 days  
6 1912-06-23 1954-06-07  15324 days  
7 1777-04-30 1855-02-23  28422 days

2. CSV, TSV 파일로 저장하기¶

In [104]:

names.to_csv('../../output/names_series.csv')

In [105]:

scientists.to_csv('../../output/scintists_df.tsv', sep='\n')

In [106]:

scientists.to_csv('../../output/scintists_df_no_index.tsv', index=False)

In [108]:

names

Out[108]:

0       Rosaline Franklin
1          William Gosset
2    Florence Nightingale
3             Marie Curie
4           Rachel Carson
5               John Snow
6             Alan Turing
7            Johann Gauss
Name: Name, dtype: object

In [112]:

# 시리즈는 엑셀과 구조가 맞지 않기 때문에 엑셀 파일로 저장할 수 없음
# 엑셀 파일로 저장할 수 있는 df 구조로 변환
names_df = names.to_frame()   # 시리즈를 데이터 프레임으로 변환

import xlwt
names_df.to_excel('../../output/scientists_names_series_df.xls')

import openpyxl
names_df.to_excel('../../output/scientists_names_series_df.xlsx')

C:\Users\Playdata\AppData\Local\Temp\ipykernel_8792\578425899.py:6: FutureWarning: As the xlwt package is no longer maintained, the xlwt engine will be removed in a future version of pandas. This is the only engine in pandas that supports writing in the xls format. Install openpyxl and write to an xlsx file instead. You can set the option io.excel.xls.writer to 'xlwt' to silence this warning. While this option is deprecated and will also raise a warning, it can be globally set and the warning suppressed.
  names_df.to_excel('../../output/scientists_names_series_df.xls')

In [ ]:

[23.06.14] Python Seaborn - 09(1) (0)	2023.06.14
[23.06.13] Python Series, DataFrame - 08(4) (0)	2023.06.13
[23.06.13] Python 클래스 - 08(2) (0)	2023.06.13
[23.06.13] Python 클래스 - 08(1) (0)	2023.06.13
[23.06.12] Python Series, DataFrame 문제 - 07(5) (0)	2023.06.12

Woogi

[23.06.13] Python Series, DataFrame - 08(3)

시리즈와 불린 추출 사용¶

벡터와 스칼라로 브로드캐스팅 수행하기¶

시리즈와 데이터 프레임의 데이터 처리하기¶

시리즈, 데이터 프레임의 데이터 섞기¶

데이터 프레임의 열 삭제¶

데이터를 피클, CSV, TSV, 파일로 저장 후 불러오기¶

1. 피클로 저장하기¶

2. CSV, TSV 파일로 저장하기¶

'데이터분석' 카테고리의 다른 글

'데이터분석'의 다른글

티스토리툴바

[23.06.13] Python Series, DataFrame - 08(3)

시리즈와 불린 추출 사용¶

벡터와 스칼라로 브로드캐스팅 수행하기¶

시리즈와 데이터 프레임의 데이터 처리하기¶

시리즈, 데이터 프레임의 데이터 섞기¶

데이터 프레임의 열 삭제¶

데이터를 피클, CSV, TSV, 파일로 저장 후 불러오기¶

1. 피클로 저장하기¶

2. CSV, TSV 파일로 저장하기¶

'데이터분석' 카테고리의 다른 글

'데이터분석'의 다른글

관련글

티스토리툴바