[WEEK1] Supervised Learning with scikit-learn

스터디 첫 주가 끝났다. 지도학습의 전반적인 내용과 sklearn 사용법에 대해 배웠다.

1. Classification: classification problems using supervised learning techniques.

- Supervised Learing = using already labeled data

- Numerical / Visual EDA = pandas, seaborn

- k-Nearest Neighbors Algorithm = setting a decision boundary

- Measuring model performance

- Train/test split = split data for training the model, testing

- Overfitting and underfitting (bias and variance)

=> in kNN, small k leads overfitting, large k leads underfitting

2. Regression

- What is regression? target data is continuous.

- Linear regression y = a x + b

- Cross-validation = inside the training set. k-fold

- Regularized regression = Penalizing large coefficients. Ridge, Lasso regression

Solving high variance problem (overfitting)

3. Fine-tuning your model

- How good is your model?

- Metrics for classification = confusion matrix

- Logistic regression and the ROC curve = TP and FP

+ PR curve = precision and recall

- Area under the ROC curve = AUC, AUC computation

if AUC > 0.5 => model is better than random guessing

- Hyperparameter tuning = with GridSearchCV (computationally expensive), RandomizedSearchCV

Hyperparameter cannot be learned by fitting the model

- Hold-out set

Meaning of Hold-out set? split up dataset into train and test set.

Cross validation = split up train set into k groups.

CV is usually preferred because we can train model on multiple train-test sets.

4. Preprocessing and pipelines

- Creating dummy variables = for categorical features -> numerical features

- Handling missing data = imputing

- Pipeline

- Centering and scaling (Normalizing)

ex) kNN = using distance between data, features on larger scales influence model

Hands-on practice

import related modules from sklearn => EDA => make instance of model => fit => predict => evaluate score

[WEEK1] Introduction to Statistics in R (0)	2022.05.19
[WEEK13] Feature Engineering for NLP in Python (0)	2021.12.06
[WEEK12] Introduction to Natural Language Processing in Python (0)	2021.11.29
[WEEK2] Unsupervised Learning in Python (0)	2021.09.07
DataCamp ML Scientist 참고 자료 (0)	2021.09.02