spaCy 사용하기 - Training Models

spaCy 사용하기 - Training Models

2018. 4. 24. 17:19ㆍProgramming/python

spaCy의 훈련 로직은 대충 아래의 그림과 같다.

training data는 text와 label로 구성이 되어져 있고, Model에서는 해당 text에 대해 label을 예측한다. 정답과 비교해서 차이만큼 gradient를 적용하고 이런식으로 반복함으로써 모델을 update하는 구조이다.

spaCy에서는 GoldParse라는 class를 지원하는데 이걸 이용해서 모델을 학습할 수도 있다.

entity 학습의 경우 BILUO scheme를 따른다.

또한 학습 성능을 향상시키기 위해 dropout을 적용할 수도 있다.

아래는 간단하게 모델을 업데이트 하는 코드를 설명하고 있다.

english 타입의 빈 모델을 만들고, training data를 적당히 섞어 준 다음에 data를 1개씩 가져와서 모델을 업데이트 하는 방식이다.

아래는 ner model을 업데이트 하는 코드이다.

	#!/usr/bin/env python
	# coding: utf8
	"""Example of training spaCy's named entity recognizer, starting off with an
	existing model or a blank model.

	For more details, see the documentation:
	* Training: https://spacy.io/usage/training
	* NER: https://spacy.io/usage/linguistic-features#named-entities

	Compatible with: spaCy v2.0.0+
	"""
	from __future__ import unicode_literals, print_function

	import plac
	import random
	from pathlib import Path
	import spacy


	# training data
	TRAIN_DATA = [
	('Who is Shaka Khan?', {
	'entities': [(7, 17, 'PERSON')]
	}),
	('I like London and Berlin.', {
	'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]
	})
	]


	@plac.annotations(
	model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
	output_dir=("Optional output directory", "option", "o", Path),
	n_iter=("Number of training iterations", "option", "n", int))
	def main(model=None, output_dir=None, n_iter=100):
	"""Load the model, set up the pipeline and train the entity recognizer."""
	if model is not None:
	nlp = spacy.load(model) # load existing spaCy model
	print("Loaded model '%s'" % model)
	else:
	nlp = spacy.blank('en') # create blank Language class
	print("Created blank 'en' model")

	# create the built-in pipeline components and add them to the pipeline
	# nlp.create_pipe works for built-ins that are registered with spaCy
	if 'ner' not in nlp.pipe_names:
	ner = nlp.create_pipe('ner')
	nlp.add_pipe(ner, last=True)
	# otherwise, get it so we can add labels
	else:
	ner = nlp.get_pipe('ner')

	# add labels
	for _, annotations in TRAIN_DATA:
	for ent in annotations.get('entities'):
	ner.add_label(ent[2])

	# get names of other pipes to disable them during training
	other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
	with nlp.disable_pipes(*other_pipes): # only train NER
	optimizer = nlp.begin_training()
	for itn in range(n_iter):
	random.shuffle(TRAIN_DATA)
	losses = {}
	for text, annotations in TRAIN_DATA:
	nlp.update(
	[text], # batch of texts
	[annotations], # batch of annotations
	drop=0.5, # dropout - make it harder to memorise data
	sgd=optimizer, # callable to update weights
	losses=losses)
	print(losses)

	# test the trained model
	for text, _ in TRAIN_DATA:
	doc = nlp(text)
	print('Entities', [(ent.text, ent.label_) for ent in doc.ents])
	print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])

	# save model to output directory
	if output_dir is not None:
	output_dir = Path(output_dir)
	if not output_dir.exists():
	output_dir.mkdir()
	nlp.to_disk(output_dir)
	print("Saved model to", output_dir)

	# test the saved model
	print("Loading from", output_dir)
	nlp2 = spacy.load(output_dir)
	for text, _ in TRAIN_DATA:
	doc = nlp2(text)
	print('Entities', [(ent.text, ent.label_) for ent in doc.ents])
	print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])


	if __name__ == '__main__':
	plac.call(main)

	# Expected output:
	# Entities [('Shaka Khan', 'PERSON')]
	# Tokens [('Who', '', 2), ('is', '', 2), ('Shaka', 'PERSON', 3),
	# ('Khan', 'PERSON', 1), ('?', '', 2)]
	# Entities [('London', 'LOC'), ('Berlin', 'LOC')]
	# Tokens [('I', '', 2), ('like', '', 2), ('London', 'LOC', 3),
	# ('and', '', 2), ('Berlin', 'LOC', 3), ('.', '', 2)]

view raw ner_model.py hosted with ❤ by GitHub

제일 먼저 model을 load하고, 그 후에 model에서 ner pipeline을 찾는다. (없으면 만들어 준다.)

그 후에 ner pipeline에 추가된 entity 이름들을 넣어준다.

그리고 나서 모델 업데이트를 진행한다. 여기서 특이한 점은 ner을 제외한 다른 pipe들은 disable을 시켜 준다는 점이다.

학습이 완료되었으면 모델을 저장하고, 테스트 데이터를 통해 해당 모델을 평가한다.

저작자표시

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Ju Factory

Ju Factory

태그

최근글

댓글

공지사항

아카이브

관련글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역