[Python] beautiful soup 사용하기

[Python] beautiful soup 사용하기

2015. 2. 17. 11:36ㆍProgramming/python

beautiful soup 설치 (우분투 환경)

$> pip install beautifulsoup4

웹페이지의 GET 데이터 정보 가져오기

from bs4 import BeautifulSoup

import urllib2

try:

response = urllib2.urlopen("가져올 웹페이지 주소")

page = response.read().decode('cp949', 'ignore') # 인코딩 변환이 필요할 경우

response.close()

except urllib2.HTTPError, e:

print e.reason.args[1]

except urllib2.URLError, e:

print e.reason.args[1]

soup = BeautifulSoup(page)

웹페이지의 GET 데이터 정보 가져오기[header 포함]

from bs4 import BeautifulSoup

import urllib2

try:

req = urllib2.Request("가져올 웹페이지 주소")

req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36')

req.add_header('Host', "HOST 주소")

req.add_header('Origin', "Origin 주소")

req.add_header('Referer', "Referer 주소")

response = urllib2.urlopen(req)

page = response.read()

soup = BeautifulSoup(page)

except urllib2.HTTPError, e:

print e.reason.args[1]

except urllib2.URLError, e:

print e.reason.args[1]

웹페이지의 POST 데이터 정보 가져오기

from bs4 import BeautifulSoup

import urllib

import urllib2

user_agent = 'Mozila/4.0 (compatible; MSIE 5.5; Windows NT)'

values = {'post 데이터 key' : 'post 데이터 값'}

headers = {'User-Agent': user_agent}

data = urllib.urlencode(values)

try:

request = urllib2.Request("가져올 웹페이지 주소", data, headers)

response = urllib2.urlopen(request)

page = response.read().decode('cp949', 'ignore') # 인코딩 변환이 필요할 경우

response.close()

except urllib2.HTTPError, e:

print e.reason.args[1]

except urllib2.URLError, e:

print e.reason.args[1]

soup = BeautifulSoup(page)

가져온 DOM 구조 Beautifulsoup로 파싱하기

# td 태그의 class 명이 'test'인 항목만 가져와서 출력해라

elements = soup.findAll('td', {'class':'test'})

for i in elements:

print str(i)

# img 태그의 속성 명이 'vspace'이고 값이 5인 항목만 가져와서 출력해라

elements = soup.findAll('img', attrs={'vspace':'5'})

for i in elements:

print str(i)

# img 태그를 가져온 다음 onclick 안의 내용을 출력해라

elements = soup.findAll('img')

for i in elements:

print str(i['onclick'])

# img 태그의 부모 노드를 가져와라

elements = soup.findAll('img')

for i in elements:

parent = i.parent

print str(parent)

# img와 동일 선상에 있는 이전 자식 노드를 가져와라

for i in elements:

sibling = i.previous_silibing

print str(sibling)

XML 구조 Beautifulsoup로 파싱하기

- XML도 DOM 구조 파싱과 동일한 형태로 진행된다.

만약 아래와 같은 xml 파일을 파싱하고자 한다면 다음과 같이 진행하면 된다.

# xml 데이터 가져오는 부분은 기존과 동일하다.

tests = soup.findAll('test')

for test in tests:

print test['value1']

print test['value2']

보다 자세한 설명은 BeautifulSoup 홈페이지를 참조하도록 하자.

저작자표시 (새창열림)

Ju Factory

Ju Factory

태그

최근글

댓글

공지사항

아카이브

관련글

티스토리툴바