스터디 첫 주가 끝났다. 지도학습의 전반적인 내용과 sklearn 사용법에 대해 배웠다.
1. Classification: classification problems using supervised learning techniques.
- Supervised Learing = using already labeled data
- Numerical / Visual EDA = pandas, seaborn
- k-Nearest Neighbors Algorithm = setting a decision boundary
- Measuring model performance
- Train/test split = split data for training the model, testing
- Overfitting and underfitting (bias and variance)
=> in kNN, small k leads overfitting, large k leads underfitting
2. Regression
- What is regression? target data is continuous.
- Linear regression y = a x + b
- Cross-validation = inside the training set. k-fold
- Regularized regression = Penalizing large coefficients. Ridge, Lasso regression
Solving high variance problem (overfitting)
3. Fine-tuning your model
- How good is your model?
- Metrics for classification = confusion matrix
- Logistic regression and the ROC curve = TP and FP
+ PR curve = precision and recall
- Area under the ROC curve = AUC, AUC computation
if AUC > 0.5 => model is better than random guessing
- Hyperparameter tuning = with GridSearchCV (computationally expensive), RandomizedSearchCV
Hyperparameter cannot be learned by fitting the model
- Hold-out set
Meaning of Hold-out set? split up dataset into train and test set.
Cross validation = split up train set into k groups.
CV is usually preferred because we can train model on multiple train-test sets.
https://medium.com/@eijaz/holdout-vs-cross-validation-in-machine-learning-7637112d3f8f
4. Preprocessing and pipelines
- Creating dummy variables = for categorical features -> numerical features
- Handling missing data = imputing
- Pipeline
- Centering and scaling (Normalizing)
ex) kNN = using distance between data, features on larger scales influence model
Hands-on practice
import related modules from sklearn => EDA => make instance of model => fit => predict => evaluate score
'Computer Science > [21-22] DataCamp ML, R' 카테고리의 다른 글
[WEEK1] Introduction to Statistics in R (0) | 2022.05.19 |
---|---|
[WEEK13] Feature Engineering for NLP in Python (0) | 2021.12.06 |
[WEEK12] Introduction to Natural Language Processing in Python (0) | 2021.11.29 |
[WEEK2] Unsupervised Learning in Python (0) | 2021.09.07 |
DataCamp ML Scientist 참고 자료 (0) | 2021.09.02 |
댓글