DataCamp 스터디 새로운 토픽 NLP에 들어갔다. 포스팅하지 않은 주에는 주로 Feature Engineering, Preprocessing, Model Validation 에 관한 내용을 다뤘다. NLP는 이 강의를 통해서 아예 처음 배웠는데 주로 텍스트를 어떻게 전처리할지에 관한 내용이였다. 그리고 여러 NLP library를 import해서 사용해봤다.
1. Regular expressions & word tokenization
- Regular expressions (regex) -> import re re.search(), re.match(), re.findall()
- Introduction to tokenization
- Word tokenization with NLTK
2. Simple topic identification
- Word counts with bag-of-words
- Building a Counter with bag-of-words
- Introduction to gensim (map text to word vector)
-> why?
Word Embedding. One-hot Encoding = 단어간 의미도와 유사도를 반영하지 못함
언어의 유사도를 반영하는 word vector 이용
- Word vectors
- Creating and querying a corpus with gensim
- Tf-idf with gensim
3. Named-entity recognition
- Named Entity Recognition
- Introduction to SpaCy
- Multilingual NER with polyglot
<Practice 1>
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import Counter
import re
from nltk.tokenize import regexp_tokenize
# Topic: Preprocessing before NLP using various modules
# Bag of words: find topic based on frequency.
# data from wiki document about wayv :) https://en.wikipedia.org/wiki/WayV
data = open('/content/data.txt', 'r')
wiki = ''
for line in data:
wiki += line
token1 = sent_tokenize(wiki)
# for loop -> tokenize each sentence
# for i, sent in enumerate(token1):
# word_tokens.append(word_tokenize(token1[i]))
# delete punctuations
word_tokens = []
# regex for delete punctuations
pattern = r"[\d+|\w+]\w+"
for i, sent in enumerate(token1):
regex_token = regexp_tokenize(sent, pattern)
word_tokens.append(regex_token)
# word_tokens = 2-D list, containing word tokens
# Bag of words method
lower_tokens = []
for element in word_tokens:
for i in element:
lower_tokens.append(i.lower())
bow_simple = Counter(lower_tokens)
print(bow_simple)
print(bow_simple.most_common(10))
# [('the', 104),
# ('in', 39),
# ('on', 38),
# ('of', 37),
# ('and', 32),
# ('nct', 27),
# ('was', 19),
# ('wayv', 19),
# ('group', 19),
# ('chart', 18)]
<Practice 2>
# a mapping between words and their integer ids.
from gensim.corpora.dictionary import Dictionary
# input of Dictionary = 2-D list
dictionary = Dictionary(tokens)
wayv_id = dictionary.token2id.get('wayv')
print(wayv_id)
# 88
print(dictionary.get(wayv_id))
# wayv
# create a Mmcorpus: corpus = 말뭉치
# tokens <= preprocessed by lowercasing, tokenizing, removing puctuation
corpus = [dictionary.doc2bow(token) for token in tokens]
print(corpus[4][:5])
# [(7, 4), (12, 1), (15, 2), (17, 2), (33, 1)]
'Computer Science > [21-22] DataCamp ML, R' 카테고리의 다른 글
[WEEK1] Introduction to Statistics in R (0) | 2022.05.19 |
---|---|
[WEEK13] Feature Engineering for NLP in Python (0) | 2021.12.06 |
[WEEK2] Unsupervised Learning in Python (0) | 2021.09.07 |
DataCamp ML Scientist 참고 자료 (0) | 2021.09.02 |
[WEEK1] Supervised Learning with scikit-learn (0) | 2021.08.31 |
댓글