spaCy 사용하기 - Vectors & Similarity

spaCy 사용하기 - Vectors & Similarity

2018. 4. 23. 16:35ㆍProgramming/python

spaCy에서는 vector similarity 기능도 제공을 해 주고 있다.

또한 아래와 같이 vector를 가지고 있는지, norm 값(여기선 L2 norm), out of vocabulary 인지 등도 확인해 볼 수 있다.

문서간의 유사도도 확인해 볼 수가 있다. 홈페이지에서는 주변 단어와의 연결 관계 등을 고려해서 철자가 틀려도 비슷한 유사도를 나타낸다고 쓰여져 있는데, 결과 값이 별로 좋지 못한 관계로 이 부분은 그냥 스킵..

자신이 직접 단어 벡터를 추가할 수도 있다. (이게 의미가 있나...)

glove vector를 추가할 수도 있다.

그 밖의 fastText vector와 같은 다른 벡터들도 추가할 수가 있다.

	#!/usr/bin/env python
	# coding: utf8
	"""Load vectors for a language trained using fastText
	https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
	Compatible with: spaCy v2.0.0+
	"""
	from __future__ import unicode_literals
	import plac
	import numpy

	import spacy
	from spacy.language import Language


	@plac.annotations(
	vectors_loc=("Path to .vec file", "positional", None, str),
	lang=("Optional language ID. If not set, blank Language() will be used.",
	"positional", None, str))
	def main(vectors_loc, lang=None):
	if lang is None:
	nlp = Language()
	else:
	# create empty language class – this is required if you're planning to
	# save the model to disk and load it back later (models always need a
	# "lang" setting). Use 'xx' for blank multi-language class.
	nlp = spacy.blank(lang)
	with open(vectors_loc, 'rb') as file_:
	header = file_.readline()
	nr_row, nr_dim = header.split()
	nlp.vocab.reset_vectors(width=int(nr_dim))
	for line in file_:
	line = line.rstrip().decode('utf8')
	pieces = line.rsplit(' ', int(nr_dim))
	word = pieces[0]
	vector = numpy.asarray([float(v) for v in pieces[1:]], dtype='f')
	nlp.vocab.set_vector(word, vector) # add the vectors to the vocab
	# test the vectors and similarity
	text = 'class colspan'
	doc = nlp(text)
	print(text, doc[0].similarity(doc[1]))


	if __name__ == '__main__':
	plac.call(main)

view raw other_vector.py hosted with ❤ by GitHub

핵심은 ' '으로 구분하고 첫번째 요소는 단어 2번째는 vector 값의 형태로 입력을 받을 수 있으면 되는 것 같다.

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

Ju Factory