본문 바로가기
Computer Science/[21-22] DataCamp ML, R

[WEEK13] Feature Engineering for NLP in Python

by gojw 2021. 12. 6.

WEEK 12에 이어서 두번째 NLP 관련 강의였다.

 

1. Basic features and readability scores

- Readability test = 얼마나 쉽게 읽히는 글인지?

=> Flesch reading ease, Gunning fog index, Simble Measure of Gobbledygook (SOMG), Dale-Chall score

- Flesch reading ease

=> 가정 1. 길이가 길수록, 2. average number of syllables 가 클수록 읽기 어려운 글

 

2. Text preprocessing, POS tagging and NER

- Tokenization and Lemmatization

- Part-of-speech tagging (POS tagging)

- Named entity recognition

 

3. N-gram models

- Building a bag of words model

BoW = based on frequency

- Building a BoW Naive Bayes classifier

- Building n-gram models

 

4. TF-IDF and similarity scores

- Building tf-idf document vectors

- How to calculate tf-idf value?

- Cosine similarity = 0~1 1=identical

- Building a plot line based recommender

 

댓글