WEEK 12에 이어서 두번째 NLP 관련 강의였다.
1. Basic features and readability scores
- Readability test = 얼마나 쉽게 읽히는 글인지?
=> Flesch reading ease, Gunning fog index, Simble Measure of Gobbledygook (SOMG), Dale-Chall score
- Flesch reading ease
=> 가정 1. 길이가 길수록, 2. average number of syllables 가 클수록 읽기 어려운 글
2. Text preprocessing, POS tagging and NER
- Tokenization and Lemmatization
- Part-of-speech tagging (POS tagging)
- Named entity recognition
3. N-gram models
- Building a bag of words model
BoW = based on frequency
- Building a BoW Naive Bayes classifier
- Building n-gram models
4. TF-IDF and similarity scores
- Building tf-idf document vectors
- How to calculate tf-idf value?
- Cosine similarity = 0~1 1=identical
- Building a plot line based recommender
'Computer Science > [21-22] DataCamp ML, R' 카테고리의 다른 글
[R] ggplot으로 linear regression 시각화 (0) | 2022.05.30 |
---|---|
[WEEK1] Introduction to Statistics in R (0) | 2022.05.19 |
[WEEK12] Introduction to Natural Language Processing in Python (0) | 2021.11.29 |
[WEEK2] Unsupervised Learning in Python (0) | 2021.09.07 |
DataCamp ML Scientist 참고 자료 (0) | 2021.09.02 |
댓글