머신러닝 기초
Lesson 4: Bias, Variance and the Trade-off
Bias is the simplifying assumptions made by a model to make the target function easier to learn.
Generally, parametric algorithms have a high bias making them fast to learn and easier to understand but generally less flexible. In turn, they have a lower predictive performance on complex problems that fail to meet the simplifying assumptions of the algorithms bias.
Variance is the amount that the estimate of the target function will change if different training data was used.
The k-Nearest Neighbors algorithm is an example of a high-variance algorithm, whereas Linear Discriminant Analysis is an example of a low variance algorithm.
The goal of any predictive modeling machine learning algorithm is to achieve low bias and low variance.
Linear Algorithms
Lesson 5: Linear Regression
y = B0 + B1 * x
predict y given the input x and the goal of the linear regression learning algorithm is to find the values for the coefficients B0 and B1.
Different techniques can be used to learn the linear regression model from data, such as a linear algebra solution for ordinary least squares and gradient descent optimization.
Lesson 6: Logistic Regression
It is the go-to method for binary classification problems (problems with two class values
Unlike linear regression, the prediction for the output is transformed using a non-linear function called the logistic function.
transform any value into the range 0 to 1.
Like linear regression, logistic regression does work better when you remove attributes that are unrelated to the output variable as well as attributes that are very similar (correlated) to each other.
Lesson 7: Linear Discriminant Analysis
If you have more than two classes then the Linear Discriminant Analysis algorithm is the preferred linear classification technique.
The representation of LDA is pretty straight forward. It consists of the statistical properties of your data, calculated for each class. For a single input variable this includes:
1. The mean value for each class.
2. The variance calculated across all classes.
Predictions are made by calculating a discriminate value for each class and making a prediction for the class with the largest value.
가우시안 분포 가정.
(전처리 단계에서 아웃라이어 제거가 도움됨)
Nonlinear Algorithms
Lesson 9: Naive Bayes
가우시안 분포 가정. 각각의 피쳐들 독립 가정.
Lesson 10: k-Nearest Neighbors
Predictions are made for a new data point by searching through the entire training set for the K most similar instances (the neighbors) and summarizing the output variable for those K instances. For regression problems, this might be the mean output variable, for classification problems this might be the mode (or most common) class value.
The trick is in how to determine the similarity between the data instances. The simplest technique if your attributes are all of the same scale (all in inches for example) is to use the Euclidean distance, a number you can calculate directly based on the differences between each input variable.
KNN can require a lot of memory or space to store all of the data, but only performs a calculation (or learn) when a prediction is needed, just in time. You can also update and curate your training instances over time to keep predictions accurate.
The idea of distance or closeness can break down in very high dimensions (lots of input variables) which can negatively affect the performance of the algorithm on your problem. This is called the curse of dimensionality. It suggests you only use those input variables that are most relevant to predicting the output variable.
Lesson 11: Learning Vector Quantization
Hi, a downside of K-Nearest Neighbors is that you need to hang on to your entire training dataset.
The Learning Vector Quantization algorithm (or LVQ for short) is an artificial neural network algorithm that allows you to choose how many training instances to hang onto and learns exactly what those instances should look like.
The representation for LVQ is a collection of codebook vectors. These are selected randomly in the beginning and adapted to best summarize the training dataset over a number of iterations of the learning algorithm.
After learned, the codebook vectors can be used to make predictions just like K-Nearest Neighbors. The most similar neighbor (best matching codebook vector) is found by calculating the distance between each codebook vector and the new data instance. The class value or (real value in the case of regression) for the best matching unit is then returned as the prediction.
Best results are achieved if you rescale your data to have the same range, such as between 0 and 1.
If you discover that KNN gives good results on your dataset try using LVQ to reduce the memory requirements of storing the entire training dataset.
Lesson 12: Support Vector Machines
A hyperplane is a line that splits the input variable space. In SVM, a hyperplane is selected to best separate the points in the input variable space by their class, either class 0 or class 1.
In two-dimensions, you can visualize this as a line and let's assume that all of our input points can be completely separated by this line.
The SVM learning algorithm finds the coefficients that result in the best separation of the classes by the hyperplane.
The distance between the hyperplane and the closest data points is referred to as the margin. The best or optimal hyperplane that can separate the two classes is the line that has the largest margin.
Only these points are relevant in defining the hyperplane and in the construction of the classifier.
These points are called the support vectors. They support or define the hyperplane.
In practice, an optimization algorithm is used to find the values for the coefficients that maximize the margin.
SVM might be one of the most powerful out-of-the-box classifiers and worth trying on your dataset.
Ensemble Algorithms
Lesson 13: Bagging and Random Forest
Hi, Random Forest is one of the most popular and most powerful machines learning algorithms. It is a type of ensemble machine learning algorithm called Bootstrap Aggregation or bagging.
The bootstrap is a powerful statistical method for estimating a quantity from a data sample. Such as a mean. You take lots of samples of your data, calculate the mean, then average all of your mean values to give you a better estimation of the true mean value.
In bagging, the same approach is used, but instead for estimating entire statistical models, most commonly decision trees.
Multiple samples of your training data are taken then models are constructed for each data sample. When you need to make a prediction for new data, each model makes a prediction and the predictions are averaged to give a better estimate of the true output value.
Random forest is a tweak on this approach where decision trees are created so that rather than selecting optimal split points, suboptimal splits are made by introducing randomness.
The models created for each sample of the data are therefore more different than they otherwise would be, but still accurate in their unique and different ways. Combining their predictions results in a better estimate of the true underlying output value.
If you get good results with an algorithm with high variance (like decision trees), you can often get better results by bagging that algorithm.
Lesson 14: Boosting and AdaBoost
Hi, boosting is an ensemble technique that attempts to create a strong classifier from a number of weak classifiers.
This is done by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of models are added.
AdaBoost was the first really successful boosting algorithm developed for binary classification. It is the best starting point for understanding boosting. Modern boosting methods build on AdaBoost, most notably stochastic gradient boosting machines.
AdaBoost is used with short decision trees. After the first tree is created, the performance of the tree on each training instance is used to weight how much attention the next tree that is created should pay attention to each training instance. Training data that is hard to predict is given more weight, whereas easy to predict instances are given less weight.
Models are created sequentially one after the other, each updating the weights on the training instances that affect the learning performed by the next tree in the sequence.
After all the trees are built, predictions are made for new data, and the performance of each tree is weighted by how accurate it was on the training data.
Because so much attention is put on correcting mistakes by the algorithm it is important that you have clean data with outliers removed.
'Articles' 카테고리의 다른 글
[Articles] 클러스터링 알고리즘들 (Unsupervised Learning이란?) (0) | 2020.02.09 |
---|---|
Git 사용하기 (0) | 2020.01.13 |
3. Ensemble (0) | 2019.12.28 |
2. Statistical Methods (0) | 2019.12.27 |
1. XG Boost (0) | 2019.12.24 |
댓글