The maximum likelihood method processes the entire training set at once, which can be computationally expensive with large datasets. Therefore, if the data is large, it is advisable to use a sequential or online algorithm that considers data points one at a time and updates the model paramters subsequently.
When applying a sequential algorithm, upon presenting data point $n$, the parameter vector $w$ is updated as follows.
Where $\tau$ is the number of iterations, and $\eta $ is a parameter that determines the learning rate. When applied to squared error function discussed in the previous post(55_Likelihood Function), it can be represented as follows.
The above formula is known as the least-mean-squares algorithm.
Overfitting refers to a situation where, after the completation of model training, the model can predict on the training data but fails to predict on new data. There are various causes for overfitting, such as using insufficient data or overly complex model. There are many methods to solve or prevent this issue.
Adding a regularization term to the error function is also a method to control overfitting. Therefore, the error function that needs to be minimized as follows.
Where $\lambda $ is the regularization coefficient that adjust the ratio between the data-depentent error ${E_D}(w)$ and the regularization term ${E_W}(w)$. The individual terms and the overall error function are as follows.
Where, by setting the gradient with respect to $w$ to 0 and solving, we get the following.
When considering multiple target variables instead of a single target variable, the problem can be formulated and performed by using the same set of basis functions to model each component of $t$.
Where $y$ is a $K$-dimensioanl column vector, $W$ is a parameter matrix of size ${M \times K}$, and $\phi (x)$ is an $M$-dimensional column vector with elements ${\phi _j}(x)$, where ${\phi _0}(x) = 1$. This represents a single-layer neural network, and the conditional distribution can be assumed to be isotropic Gaussian.
The log likelihood function is as follows.
Similarly, maximizing with respect to $W$ yields the following.
Where ${t_k}$ is an $N$-dimensional column vector. Therefore, the solution to the regression problem is separated for different target variables, and only the matrix($\Phi $) shaered by all vectors ${w_k}$ needs to be calculated.
ref : Chris Bishop's "Deep Learning - Foundations and Concepts"
'Deep Learning' 카테고리의 다른 글
58_Generative Classifiers (0) | 2024.01.27 |
---|---|
57_Single Layer Classification (1) | 2024.01.26 |
55_Likelihood Function (0) | 2024.01.24 |
54_Linear Regression (0) | 2024.01.23 |
53_Bernoulli, Binomial, and Multinomial Distribution (0) | 2024.01.22 |