본문 바로가기
Computer Science/[21-22] DataCamp ML, R

[WEEK12] Introduction to Natural Language Processing in Python

by gojw 2021. 11. 29.

DataCamp 스터디 새로운 토픽 NLP에 들어갔다. 포스팅하지 않은 주에는 주로 Feature Engineering, Preprocessing, Model Validation 에 관한 내용을 다뤘다. NLP는 이 강의를 통해서 아예 처음 배웠는데 주로 텍스트를 어떻게 전처리할지에 관한 내용이였다. 그리고 여러 NLP library를 import해서 사용해봤다.

 

1. Regular expressions & word tokenization

- Regular expressions (regex) -> import re re.search(), re.match(), re.findall()

- Introduction to tokenization

- Word tokenization with NLTK

 

2. Simple topic identification

- Word counts with bag-of-words

- Building a Counter with bag-of-words

- Introduction to gensim (map text to word vector)

-> why?

Word Embedding. One-hot Encoding = 단어간 의미도와 유사도를 반영하지 못함

언어의 유사도를 반영하는 word vector 이용

- Word vectors

- Creating and querying a corpus with gensim

- Tf-idf with gensim

 

3. Named-entity recognition

- Named Entity Recognition

- Introduction to SpaCy

- Multilingual NER with polyglot

 

<Practice 1>

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import Counter
import re
from nltk.tokenize import regexp_tokenize

# Topic: Preprocessing before NLP using various modules
# Bag of words: find topic based on frequency.

# data from wiki document about wayv :) https://en.wikipedia.org/wiki/WayV
data = open('/content/data.txt', 'r')
wiki = ''
for line in data:
  wiki += line

token1 = sent_tokenize(wiki)

# for loop -> tokenize each sentence
# for i, sent in enumerate(token1):
  # word_tokens.append(word_tokenize(token1[i]))

# delete punctuations
word_tokens = []
# regex for delete punctuations
pattern = r"[\d+|\w+]\w+"
for i, sent in enumerate(token1):
  regex_token = regexp_tokenize(sent, pattern)
  word_tokens.append(regex_token)

# word_tokens = 2-D list, containing word tokens

# Bag of words method
lower_tokens = []

for element in word_tokens:
  for i in element:
    lower_tokens.append(i.lower())

bow_simple = Counter(lower_tokens)
print(bow_simple)

print(bow_simple.most_common(10))
# [('the', 104),
#  ('in', 39),
#  ('on', 38),
#  ('of', 37),
#  ('and', 32),
#  ('nct', 27),
#  ('was', 19),
#  ('wayv', 19),
#  ('group', 19),
#  ('chart', 18)]

 

<Practice 2>

# a mapping between words and their integer ids.
from gensim.corpora.dictionary import Dictionary

# input of Dictionary = 2-D list
dictionary = Dictionary(tokens)

wayv_id = dictionary.token2id.get('wayv')
print(wayv_id)
# 88

print(dictionary.get(wayv_id))
# wayv

# create a Mmcorpus: corpus = 말뭉치
# tokens <= preprocessed by lowercasing, tokenizing, removing puctuation
corpus = [dictionary.doc2bow(token) for token in tokens]
print(corpus[4][:5])
# [(7, 4), (12, 1), (15, 2), (17, 2), (33, 1)]

댓글