데이터분석

[23.06.14] Python Seaborn - 09(1)

gmwoo 2023. 6. 14. 14:08

그래픽이 깨질 때 방지용 / 굉장히 많이 쓰이므로 외우거나 따로 필기 해놓자!

from matplotlib import font_manager, rc
plt.rcParams['axes.unicode_minus'] = False   # - 기호 깨지는 것 방지
# f_path = "/Library/Fonts/AppleGothic.ttf"   -> MAC
f_path = "C:/Windows/Fonts/malgun.ttf"
font_name = font_manager.FontProperties(fname=f_path).get_name()
rc('font', family=font_name)

1. Seaborn

그래프¶

앤스콤 데이터 집합 불러온 후 그래프 그리기¶

1. 앤스콤 데이터 집합 불러오기¶

In [1]:

import seaborn as sns
anscombe = sns.load_dataset("anscombe")
print(anscombe)
print(type(anscombe))

   dataset     x      y
0        I  10.0   8.04
1        I   8.0   6.95
2        I  13.0   7.58
3        I   9.0   8.81
4        I  11.0   8.33
5        I  14.0   9.96
6        I   6.0   7.24
7        I   4.0   4.26
8        I  12.0  10.84
9        I   7.0   4.82
10       I   5.0   5.68
11      II  10.0   9.14
12      II   8.0   8.14
13      II  13.0   8.74
14      II   9.0   8.77
15      II  11.0   9.26
16      II  14.0   8.10
17      II   6.0   6.13
18      II   4.0   3.10
19      II  12.0   9.13
20      II   7.0   7.26
21      II   5.0   4.74
22     III  10.0   7.46
23     III   8.0   6.77
24     III  13.0  12.74
25     III   9.0   7.11
26     III  11.0   7.81
27     III  14.0   8.84
28     III   6.0   6.08
29     III   4.0   5.39
30     III  12.0   8.15
31     III   7.0   6.42
32     III   5.0   5.73
33      IV   8.0   6.58
34      IV   8.0   5.76
35      IV   8.0   7.71
36      IV   8.0   8.84
37      IV   8.0   8.47
38      IV   8.0   7.04
39      IV   8.0   5.25
40      IV  19.0  12.50
41      IV   8.0   5.56
42      IV   8.0   7.91
43      IV   8.0   6.89
<class 'pandas.core.frame.DataFrame'>

In [2]:

import matplotlib.pyplot as plt

In [5]:

dataset_1 = anscombe[anscombe['dataset'] == 'I']

In [6]:

print(dataset_1)

   dataset     x      y
0        I  10.0   8.04
1        I   8.0   6.95
2        I  13.0   7.58
3        I   9.0   8.81
4        I  11.0   8.33
5        I  14.0   9.96
6        I   6.0   7.24
7        I   4.0   4.26
8        I  12.0  10.84
9        I   7.0   4.82
10       I   5.0   5.68

In [8]:

plt.plot(dataset_1['x'], dataset_1['y'])

Out[8]:

[<matplotlib.lines.Line2D at 0x1eec4bcae30>]

In [11]:

plt.plot(dataset_1['x'], dataset_1['y'], 'o', c='r')

Out[11]:

[<matplotlib.lines.Line2D at 0x1eec73c0070>]

In [14]:

plt.plot([1, 4, 9, 16], c='b',
        lw=5, ls="--", marker='o',
        ms=15, mec='g', mew=5, mfc='r')

plt.xlim(-0.2, 3.2)
plt.ylim(-1, 18)
plt.show()

matplotlib 라이브러리로 그래프 그리기¶

1. 전체 그래프가 위치할 기본 틀을 만들고¶

2. 그래프를 그려 넣을 격자¶

3. 격자에 그래프를 하나씩 추가, 순서는 왼쪽 --> 오른쪽¶

4. 1행이 차면 2번째 행에 그려 넣는다¶

In [15]:

dataset_2 = anscombe[anscombe['dataset']=="II"]
dataset_3 = anscombe[anscombe['dataset']=="III"]
dataset_4 = anscombe[anscombe['dataset']=="IV"]

In [16]:

fig = plt.figure()
axes1 = fig.add_subplot(2,2,1)  # 행의 크기, 열의 크기
axes2 = fig.add_subplot(2,2,2)
axes3 = fig.add_subplot(2,2,3)
axes4 = fig.add_subplot(2,2,4)

In [35]:

axes1.plot(dataset_1['x'], dataset_1['y'], 'o', c='r')
axes2.plot(dataset_2['x'], dataset_2['y'], 'o', c='b')
axes3.plot(dataset_3['x'], dataset_3['y'], 'o', c='y')
axes4.plot(dataset_4['x'], dataset_4['y'], 'o', c='g')

fig

Out[35]:

In [36]:

# 제목 추가
axes1.set_title("dataset_1")
axes2.set_title("dataset_2")
axes3.set_title("dataset_3")
axes4.set_title("dataset_4")

fig

Out[36]:

In [37]:

fig.suptitle("Anscombe Data")
fig

Out[37]:

In [38]:

fig.tight_layout()
fig

Out[38]:

기초 그래프 그리기 - 히스토그램, 산점도, 박스 그래프¶

In [40]:

tips = sns.load_dataset("tips")
print(tips.head())

   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4

In [53]:

# fig(기본 틀), 그래프 격자(axes1)
fig = plt.figure()
axes1 = fig.add_subplot(1,1,1)

1. 히스토그램은 df의 열 데이터 분포와 빈도를 살펴보는 용도로 자주 사용¶

변수를 하나만 사용해서 그린 '일변량 그래프'¶

In [54]:

axes1.hist(tips['total_bill'], bins=10) # bins=10: x축의 간격을 10으로 조정
axes1.set_title('Histogram of Total Bill')
axes1.set_xlabel('Frequency')
axes1.set_ylabel('Total Bill')
fig

Out[54]:

2. 산점도 그래프¶

변수 2개 사용한 '이변량 그래프' - total_bill 열에 따른 tip 열의 분포를 나타낸 산점도 그래프¶

In [55]:

scatter_plot = plt.figure()
axes1 = scatter_plot.add_subplot(1,1,1)
axes1.scatter(tips['total_bill'], tips['tip'])
axes1.set_title('Scatterplot of Total Bill vs Tip')
axes1.set_xlabel('Total Bill')
axes1.set_ylabel('Tip')

Out[55]:

Text(0, 0.5, 'Tip')

3. 박스 그래프¶

이산형 변수 - Female, Male 처럼 명확하게 구분되는 값¶

연속형 변수 - Tip 과 같이 명혹하게 셀 수 없는 변위의 값¶

In [56]:

boxplot = plt.figure()
axes1 = boxplot.add_subplot(1,1,1)

axes1.boxplot(
    [tips[tips['sex'] == 'Female']['tip'], # tips df에서 성별이 'Female'인 tip 열
    tips[tips['sex'] == 'Male']['tip']],
    labels = ['Female', 'Male']
)

axes1.set_xlabel('Sex')
axes1.set_ylabel('Tip')
axes1.set_title('Boxplot of Tips by Sex')

Out[56]:

Text(0.5, 1.0, 'Boxplot of Tips by Sex')

다변량 데이터로 다변량 그래프 그리기 - 산점도 그래프¶

다변량 - 3개 이상의 변수를 사용

In [57]:

# 성별을 새 변수로 추가
# (문자열은 산점도 그래프의 색상을 지정하는 값을 사용할 수 없다)
def recode_sex(sex):
    if sex == 'Female':
        return 0
    else:
        return 1

In [58]:

tips['sex_color'] = tips['sex'].apply(recode_sex)
tips

Out[58]:

	total_bill	tip	sex	smoker	day	time	size	sex_color
0	16.99	1.01	Female	No	Sun	Dinner	2	0
1	10.34	1.66	Male	No	Sun	Dinner	3	1
2	21.01	3.50	Male	No	Sun	Dinner	3	1
3	23.68	3.31	Male	No	Sun	Dinner	2	1
4	24.59	3.61	Female	No	Sun	Dinner	4	0
...	...	...	...	...	...	...	...	...
239	29.03	5.92	Male	No	Sat	Dinner	3	1
240	27.18	2.00	Female	Yes	Sat	Dinner	2	0
241	22.67	2.00	Male	Yes	Sat	Dinner	2	1
242	17.82	1.75	Male	No	Sat	Dinner	2	1
243	18.78	3.00	Female	No	Thur	Dinner	2	0

244 rows × 8 columns

In [67]:

scatter_plot = plt.figure()
axes1 = scatter_plot.add_subplot(1,1,1)
axes1.scatter(
    x=tips['total_bill'],
    y=tips['tip'],
    s=tips['size'] * 10,
    c=tips['sex_color'],
    alpha=0.5
)


axes1.set_title('Total Bill vs Tip Colored by Sex and Sized by Size')
axes1.set_xlabel('Total Bill')
axes1.set_ylabel('Tip')

Out[67]:

Text(0, 0.5, 'Tip')

seaborn 라이브러리로 히스토그램¶

seaborn 라이브러리로 히스토그램을 그리려면 subplots, distplot 메서드 사용
distplot 메서드: 기본 틀 만듦, total_bill 메서드: total_bill 열 데이터 전달

In [71]:

import warnings
warnings.filterwarnings(action='ignore')

In [72]:

ax = plt.subplots()
ax = sns.distplot(tips['total_bill'])
ax.set_title('Total Bill Histogram with Density Plot')

Out[72]:

Text(0.5, 1.0, 'Total Bill Histogram with Density Plot')

In [73]:

# 밀집도(정규화 시켜 넓이가 1이 되도록 그린 그래프) 제외
# Kernel Density Estimation (커널 밀도 추정)
ax = plt.subplots()
ax = sns.distplot(tips['total_bill'], kde=False)
ax.set_title('Total Bill Histogram with Density Plot')

Out[73]:

Text(0.5, 1.0, 'Total Bill Histogram with Density Plot')

In [75]:

# 히스토그램 제외, 밀도 함수만 출력
ax = plt.subplots()
ax = sns.distplot(tips['total_bill'], hist=False)
ax.set_title('Total Bill Histogram with Density Plot')

Out[75]:

Text(0.5, 1.0, 'Total Bill Histogram with Density Plot')

In [78]:

# rug 인자 추가 - 그래프의 축에 동일한 길이의 직선을 붙여 데이터의 밀집 정도를 표현한 그래프
ax = plt.subplots()
ax = sns.distplot(tips['total_bill'], rug=True)
ax.set_title('Total Bill Histogram with Density and Rug Plot')
ax.set_xlabel('Total Bill')

Out[78]:

Text(0.5, 0, 'Total Bill')

count 그래프 - 이산값을 나타낸 그래프¶

주로 범주형 변수의 분포를 파악할 때 사용
sns.countplot(x or y, data)

In [97]:

fig = plt.figure()
ax1 = plt.subplots()
ax1 = sns.countplot(x='day', data=tips)
ax1.set_title('Count of days')
ax1.set_xlabel('Day of the Week')
ax1.set_ylabel('Frequency')

ax2 = plt.subplots()
ax2 = sns.countplot(y='day', data=tips)
ax2.set_title('Count of days')
ax2.set_xlabel('Day of the Week')
ax2.set_ylabel('Frequency')

Out[97]:

Text(0, 0.5, 'Frequency')

<Figure size 640x480 with 0 Axes>

다양한 종류의 이변량 그래프 그리기¶

1. seaborn 라이브러리로 산점도 그래프 그리기¶

regplot 메서드를 사용하면 산점도 그래프와 회귀선을 그릴 수 있음

In [89]:

ax = plt.subplots()
ax = sns.regplot(x='total_bill', y ='tip', data=tips)
ax.set_title('Scatterplot of Total Bill and Tip')
ax.set_xlabel('Total Bill')
ax.set_ylabel('Tip')

Out[89]:

Text(0, 0.5, 'Tip')

In [99]:

ax = plt.subplots()
ax = sns.regplot(x='total_bill', y ='tip', data=tips, fit_reg=False)
ax.set_title('Scatterplot of Total Bill and Tip')
ax.set_xlabel('Total Bill')
ax.set_ylabel('Tip')

Out[99]:

Text(0, 0.5, 'Tip')

바 그래프 그리기¶

In [102]:

# barplot은 지정한 변수의 평균을 계산하여 그림
# 시간에 따라 지불한 비용의 평균을 바 그래프로 나타냄
ax = plt.subplots()
ax = sns.barplot(x='time', y='total_bill', data=tips)
ax.set_title('Bar plot of average total bill for time of day')
ax.set_xlabel('Time of day')
ax.set_ylabel('Average total bill')

Out[102]:

Text(0, 0.5, 'Average total bill')

박스 그래프 그리기¶

박스 그래프는 최솟값, 1사분위수, 중간값, 3사분위수, 최대값, 이상치 등

In [103]:

ax = plt.subplots()
ax = sns.boxplot(x='time', y='total_bill', data=tips)
ax.set_title('Boxplot of average total bill for time of day')
ax.set_xlabel('Time of day')
ax.set_ylabel('Average total bill')

Out[103]:

Text(0, 0.5, 'Average total bill')

seaborn 라이브러리로 바이올린 그래프 그리기 - 색상추가¶

In [106]:

ax = plt.subplots()
ax = sns.violinplot(x='time', y='total_bill', hue='sex',
                    data=tips, split=True)
ax.set_title('Boxplot of average total bill for time of day')
ax.set_xlabel('Time of day')
ax.set_ylabel('Average total bill')

Out[106]:

Text(0, 0.5, 'Average total bill')

In [107]:

labels = 'Frogs', 'Hogs', 'Dogs', 'Logs'
sizes = [15, 30, 45, 10]
explode = (0, 0.1, 0 , 0)  # only "explode" the 2nd slice (i.e. 'Hogs')

fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
       shadow=True, startangle=90)
ax1.axis('equal')
plt.show()

그래픽이 깨질 때 방지용¶

굉장히 많이 쓰이니 외우거나 따로 필기!!¶

In [110]:

from matplotlib import font_manager, rc
plt.rcParams['axes.unicode_minus'] = False   # - 기호 깨지는 것 방지
# f_path = "/Library/Fonts/AppleGothic.ttf"   -> MAC
f_path = "C:/Windows/Fonts/malgun.ttf"
font_name = font_manager.FontProperties(fname=f_path).get_name()
rc('font', family=font_name)

문제¶

1. titani 데이터를 로드해서 전체 승객을 나이별로 히스토그램으로 출력¶

In [124]:

import seaborn as sns
import matplotlib.pyplot as plt

In [151]:

titanic = sns.load_dataset("titanic")
print(titanic.columns)

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

In [178]:

fig = plt.figure()
plt.hist(titanic['age'], bins=10)
plt.xlabel('age')
plt.ylabel('count')
plt.show()

2. 남, 여 승객수를 나타내시오 (countplot)¶

In [162]:

fig = plt.figure()
ax = plt.subplots()
ax = sns.countplot(x='sex', data=tips)
ax.set_title('Sex')

Out[162]:

Text(0.5, 1.0, 'Sex')

<Figure size 640x480 with 0 Axes>

3. 사망자와 생존자를 출력 (pie)¶

In [181]:

labels = ['사망', '생존']
ax = plt.subplots()
titanic['survived'].value_counts().plot.pie(autopct='%1.1f%%', shadow=True, 
                                            explode=(0, 0.1), labels=labels,
                                           startangle=90)
plt.show()

저작자표시 (새창열림)

'데이터분석' 카테고리의 다른 글

[23.06.15] Python concat - 10(1) (0)	2023.06.15
[23.06.14] Python Seaborn - 09(2) (0)	2023.06.14
[23.06.13] Python Series, DataFrame - 08(4) (0)	2023.06.13
[23.06.13] Python Series, DataFrame - 08(3) (0)	2023.06.13
[23.06.13] Python 클래스 - 08(2) (0)	2023.06.13

현재글[23.06.14] Python Seaborn - 09(1)

Woogi

[23.06.14] Python Seaborn - 09(1)

1. Seaborn

그래프¶

앤스콤 데이터 집합 불러온 후 그래프 그리기¶

1. 앤스콤 데이터 집합 불러오기¶

matplotlib 라이브러리로 그래프 그리기¶

1. 전체 그래프가 위치할 기본 틀을 만들고¶

2. 그래프를 그려 넣을 격자¶

3. 격자에 그래프를 하나씩 추가, 순서는 왼쪽 --> 오른쪽¶

4. 1행이 차면 2번째 행에 그려 넣는다¶

기초 그래프 그리기 - 히스토그램, 산점도, 박스 그래프¶

1. 히스토그램은 df의 열 데이터 분포와 빈도를 살펴보는 용도로 자주 사용¶

변수를 하나만 사용해서 그린 '일변량 그래프'¶

2. 산점도 그래프¶

변수 2개 사용한 '이변량 그래프' - total_bill 열에 따른 tip 열의 분포를 나타낸 산점도 그래프¶

3. 박스 그래프¶

이산형 변수 - Female, Male 처럼 명확하게 구분되는 값¶

연속형 변수 - Tip 과 같이 명혹하게 셀 수 없는 변위의 값¶

다변량 데이터로 다변량 그래프 그리기 - 산점도 그래프¶

seaborn 라이브러리로 히스토그램¶

count 그래프 - 이산값을 나타낸 그래프¶

다양한 종류의 이변량 그래프 그리기¶

1. seaborn 라이브러리로 산점도 그래프 그리기¶

바 그래프 그리기¶

박스 그래프 그리기¶

seaborn 라이브러리로 바이올린 그래프 그리기 - 색상추가¶

그래픽이 깨질 때 방지용¶

굉장히 많이 쓰이니 외우거나 따로 필기!!¶

문제¶

1. titani 데이터를 로드해서 전체 승객을 나이별로 히스토그램으로 출력¶

2. 남, 여 승객수를 나타내시오 (countplot)¶

3. 사망자와 생존자를 출력 (pie)¶

'데이터분석' 카테고리의 다른 글

'데이터분석'의 다른글

티스토리툴바

[23.06.14] Python Seaborn - 09(1)

1. Seaborn

그래프¶

앤스콤 데이터 집합 불러온 후 그래프 그리기¶

1. 앤스콤 데이터 집합 불러오기¶

matplotlib 라이브러리로 그래프 그리기¶

1. 전체 그래프가 위치할 기본 틀을 만들고¶

2. 그래프를 그려 넣을 격자¶

3. 격자에 그래프를 하나씩 추가, 순서는 왼쪽 --> 오른쪽¶

4. 1행이 차면 2번째 행에 그려 넣는다¶

기초 그래프 그리기 - 히스토그램, 산점도, 박스 그래프¶

1. 히스토그램은 df의 열 데이터 분포와 빈도를 살펴보는 용도로 자주 사용¶

변수를 하나만 사용해서 그린 '일변량 그래프'¶

2. 산점도 그래프¶

변수 2개 사용한 '이변량 그래프' - total_bill 열에 따른 tip 열의 분포를 나타낸 산점도 그래프¶

3. 박스 그래프¶

이산형 변수 - Female, Male 처럼 명확하게 구분되는 값¶

연속형 변수 - Tip 과 같이 명혹하게 셀 수 없는 변위의 값¶

다변량 데이터로 다변량 그래프 그리기 - 산점도 그래프¶

seaborn 라이브러리로 히스토그램¶

count 그래프 - 이산값을 나타낸 그래프¶

다양한 종류의 이변량 그래프 그리기¶

1. seaborn 라이브러리로 산점도 그래프 그리기¶

바 그래프 그리기¶

박스 그래프 그리기¶

seaborn 라이브러리로 바이올린 그래프 그리기 - 색상추가¶

그래픽이 깨질 때 방지용¶

굉장히 많이 쓰이니 외우거나 따로 필기!!¶

문제¶

1. titani 데이터를 로드해서 전체 승객을 나이별로 히스토그램으로 출력¶

2. 남, 여 승객수를 나타내시오 (countplot)¶

3. 사망자와 생존자를 출력 (pie)¶

'데이터분석' 카테고리의 다른 글

'데이터분석'의 다른글

관련글

티스토리툴바