본문 바로가기
Computer Science/[21-22] DataCamp ML, R

[WEEK1] Supervised Learning with scikit-learn

by gojw 2021. 8. 31.

스터디 첫 주가 끝났다. 지도학습의 전반적인 내용과 sklearn 사용법에 대해 배웠다. 

 

1. Classification: classification problems using supervised learning techniques.

- Supervised Learing = using already labeled data

- Numerical / Visual EDA = pandas, seaborn

- k-Nearest Neighbors Algorithm = setting a decision boundary

- Measuring model performance

- Train/test split = split data for training the model, testing

- Overfitting and underfitting (bias and variance)

=> in kNN, small k leads overfitting, large k leads underfitting

 

2. Regression

- What is regression? target data is continuous.

- Linear regression y = a x + b

- Cross-validation = inside the training set. k-fold

- Regularized regression = Penalizing large coefficients. Ridge, Lasso regression

Solving high variance problem (overfitting)

 

3. Fine-tuning your model

- How good is your model?

- Metrics for classification = confusion matrix

- Logistic regression and the ROC curve = TP and FP

+ PR curve = precision and recall

- Area under the ROC curve = AUC, AUC computation

if AUC > 0.5 => model is better than random guessing

- Hyperparameter tuning = with GridSearchCV (computationally expensive), RandomizedSearchCV

Hyperparameter cannot be learned by fitting the model

- Hold-out set

Meaning of Hold-out set? split up dataset into train and test set. 

Cross validation = split up train set into k groups.

CV is usually preferred because we can train model on multiple train-test sets.

https://medium.com/@eijaz/holdout-vs-cross-validation-in-machine-learning-7637112d3f8f

 

4. Preprocessing and pipelines

- Creating dummy variables = for categorical features -> numerical features

- Handling missing data = imputing

- Pipeline

- Centering and scaling (Normalizing)

ex) kNN = using distance between data, features on larger scales influence model

 

Hands-on practice

import related modules from sklearn => EDA => make instance of model => fit => predict => evaluate score

 

 

댓글