1. HTML 구조
* <!Doctype htmlp : HTML5 문서를 선언하는 구문
* <html></html> : HTML 문서의 시작과 끝
* <head></head> : CSS, JavaScript, meta, title 정보들을 설정
* <body>=/body> : 실제 홈페이지 화면에 나타나는 부분
2. 용어
html head body div p a b br
3. BeautifulSoup 모듈 정의
* 홈페이지 내 데이터를 쉽게 추출할 수 있도록 도와주는 파이썬 외부 라이브러리
* 웹 문서 내 수많은 html 태그들을 파서(parser)를 활용해 사용하기 편한 파이썬 객체로 만들어 제공
In [1]:
from bs4 import BeautifulSoup
In [2]:
page = open("../../data/test1.html", "r").read()
soup = BeautifulSoup(page, 'html.parser')
print(soup.prettify())
#.prettyfy() 옵션은 html 페이지의 내용 전체를 보고자 할 때
<!DOCTYPE html>
<html>
<head>
<title>
Very Simple HTML Code
</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
Naver Home
<a href="https://www.naver.com/">
Naver
</a>
</p>
<p class="inner-text second-item">
Happy Data Science.
<a href="https://www.python.org" id="py-link">
Python
</a>
</p>
</div>
<p class="outer-text first-item" id="second">
<b>
Data Science is funny.
</b>
</p>
<p class="outer-text">
<b>
Have a nice day.
<br/>
Have a niece day2.
</b>
</p>
</body>
</html>
In [3]:
list(soup.children)
Out[3]:
['html',
'\n',
<html>
<head>
<title>Very Simple HTML Code</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
Naver Home
<a href="https://www.naver.com/">Naver</a>
</p>
<p class="inner-text second-item">
Happy Data Science.
<a href="https://www.python.org" id="py-link">Python</a>
</p>
</div>
<p class="outer-text first-item" id="second">
<b>
Data Science is funny.
</b>
</p>
<p class="outer-text">
<b>
Have a nice day.<br/>
Have a niece day2.
</b>
</p>
</body>
</html>,
'\n']
In [4]:
html = list(soup.children)[2] # html 태그에 접속하고 싶을때
html
Out[4]:
<html>
<head>
<title>Very Simple HTML Code</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
Naver Home
<a href="https://www.naver.com/">Naver</a>
</p>
<p class="inner-text second-item">
Happy Data Science.
<a href="https://www.python.org" id="py-link">Python</a>
</p>
</div>
<p class="outer-text first-item" id="second">
<b>
Data Science is funny.
</b>
</p>
<p class="outer-text">
<b>
Have a nice day.<br/>
Have a niece day2.
</b>
</p>
</body>
</html>
In [5]:
## 아래의 body와 soup.body는 같은 결과
body = list(html.children)[3] # html 태그의 children 중 3번째 접속하고 싶을때
soup.body # children을 이용해서 태그를 조사할 수 있고 한 번에 나타낼수 있음
Out[5]:
<body>
<div>
<p class="inner-text first-item" id="first">
Naver Home
<a href="https://www.naver.com/">Naver</a>
</p>
<p class="inner-text second-item">
Happy Data Science.
<a href="https://www.python.org" id="py-link">Python</a>
</p>
</div>
<p class="outer-text first-item" id="second">
<b>
Data Science is funny.
</b>
</p>
<p class="outer-text">
<b>
Have a nice day.<br/>
Have a niece day2.
</b>
</p>
</body>
In [6]:
list(body.children)
Out[6]:
['\n',
<div>
<p class="inner-text first-item" id="first">
Naver Home
<a href="https://www.naver.com/">Naver</a>
</p>
<p class="inner-text second-item">
Happy Data Science.
<a href="https://www.python.org" id="py-link">Python</a>
</p>
</div>,
'\n',
<p class="outer-text first-item" id="second">
<b>
Data Science is funny.
</b>
</p>,
'\n',
<p class="outer-text">
<b>
Have a nice day.<br/>
Have a niece day2.
</b>
</p>,
'\n']
In [7]:
# 접근해야 할 태그를 알고 있다면 find, find_all 명령을 사용
soup.find_all('p')
Out[7]:
[<p class="inner-text first-item" id="first">
Naver Home
<a href="https://www.naver.com/">Naver</a>
</p>,
<p class="inner-text second-item">
Happy Data Science.
<a href="https://www.python.org" id="py-link">Python</a>
</p>,
<p class="outer-text first-item" id="second">
<b>
Data Science is funny.
</b>
</p>,
<p class="outer-text">
<b>
Have a nice day.<br/>
Have a niece day2.
</b>
</p>]
In [8]:
# 첫 번째 p 태그를 찾을 때
soup.find('p')
Out[8]:
<p class="inner-text first-item" id="first">
Naver Home
<a href="https://www.naver.com/">Naver</a>
</p>
In [9]:
# p 태그의 class가 outer_text인 것을 찾아 줌
soup.find_all('p', class_='outer-text')
Out[9]:
[<p class="outer-text first-item" id="second">
<b>
Data Science is funny.
</b>
</p>,
<p class="outer-text">
<b>
Have a nice day.<br/>
Have a niece day2.
</b>
</p>]
In [10]:
# class 이름으로만 outer-text를 찾음
soup.find_all(class_='outer-text')
Out[10]:
[<p class="outer-text first-item" id="second">
<b>
Data Science is funny.
</b>
</p>,
<p class="outer-text">
<b>
Have a nice day.<br/>
Have a niece day2.
</b>
</p>]
In [11]:
# id로 접근
soup.find_all(id="first")
Out[11]:
[<p class="inner-text first-item" id="first">
Naver Home
<a href="https://www.naver.com/">Naver</a>
</p>]
In [12]:
soup.head
Out[12]:
<head>
<title>Very Simple HTML Code</title>
</head>
In [13]:
# soup의 head 다음에 줄바꿈 문자가 있음
soup.head.next_sibling
Out[13]:
'\n'
In [14]:
soup.head.previous_sibling
Out[14]:
'\n'
In [15]:
soup.head.next_sibling.next_sibling
Out[15]:
<body>
<div>
<p class="inner-text first-item" id="first">
Naver Home
<a href="https://www.naver.com/">Naver</a>
</p>
<p class="inner-text second-item">
Happy Data Science.
<a href="https://www.python.org" id="py-link">Python</a>
</p>
</div>
<p class="outer-text first-item" id="second">
<b>
Data Science is funny.
</b>
</p>
<p class="outer-text">
<b>
Have a nice day.<br/>
Have a niece day2.
</b>
</p>
</body>
In [16]:
# 제일 처음 나타나는 p 태그
body.p
Out[16]:
<p class="inner-text first-item" id="first">
Naver Home
<a href="https://www.naver.com/">Naver</a>
</p>
In [17]:
body.p.next_sibling.next_sibling
Out[17]:
<p class="inner-text second-item">
Happy Data Science.
<a href="https://www.python.org" id="py-link">Python</a>
</p>
In [18]:
for each_tag in soup.find_all('p'):
print(each_tag.get_text())
Naver Home
Naver
Happy Data Science.
Python
Data Science is funny.
Have a nice day.
Have a niece day2.
문제¶
Have a nice day. \ Have a nice day2.
In [19]:
for each_tag in soup.find_all('p')[-1]:
print(each_tag.get_text())
Have a nice day.
Have a niece day2.
In [20]:
# 태그가 있던 자리는 줄바꿈(\n)이 표시되고 전체 텍스트를 보여줌
body.get_text()
Out[20]:
'\n\n\n Naver Home\n Naver\n\n\n Happy Data Science.\n Python\n\n\n\n\n Data Science is funny.\n \n\n\n\n Have a nice day.\n\t\t Have a niece day2.\n \n\n'
In [21]:
# 클릭 가능한 링크인 a 태그를 찾음
links = soup.find_all('a')
links
Out[21]:
[<a href="https://www.naver.com/">Naver</a>,
<a href="https://www.python.org" id="py-link">Python</a>]
In [22]:
for each in links:
href = each['href']
text = each.string
print(text + ' -> ' + href)
Naver -> https://www.naver.com/
Python -> https://www.python.org
크롬 개발자 도구를 이용해서 원하는 태그 찾기¶
In [23]:
from urllib.request import urlopen
In [24]:
url = "https://finance.naver.com/marketindex/"
page = urlopen(url)
soup = BeautifulSoup(page, "html.parser")
# print(soup.prettify())
In [25]:
# soup.find_all()
In [26]:
soup.find_all('span','value')[0].string
Out[26]:
'1,319.80'
반응형
'데이터분석' 카테고리의 다른 글
[23.06.30] 데이터 시각화(따릉이) - 21(3) (0) | 2023.06.30 |
---|---|
[23.06.30] 웹 크롤링(샌드위치 맛집) - 21(2) (0) | 2023.06.30 |
[23.06.29] 데이터 시각화 - 20(1) (0) | 2023.06.29 |
[23.06.28] 데이터 시각화(CCTV) - 19(2) (0) | 2023.06.28 |
[23.06.28] 데이터 시각화(폰트 깨짐 방지, warning 무시) - 19(1) (0) | 2023.06.28 |