Skip to content

Notes

What is a cost function?

A cost function is a mathematical function that measures the performance or discrepancy of a machine learning model on a given task. It quantifies the difference between the predicted outputs of the model and the true or desired outputs.

What is the difference between a cost function and a loss function?

Well in simple words loss function is a specific type of cost function that measures the performance of a model on individual training examples whereas when we are talking about the whole Dataset we will use the cost function.

What is RSS? And why we need to minimize a cost function?

RSS stands for Residual Sum of Squares. It is a cost function. RSS is a measure which is used to quantify the discrepancy between the predicted values of a model and the actual observed values. The need to optimize/minimize a cost function arises from the desire to create models that can accurately make predictions or perform a specific task. By minimizing the cost function we aim to find the best set of parameter values that result in the model making predictions that are as close as possible to the true values.

What are some ways to minimize cost function?

So there are two types of minimizations, namely constrained and unconstrained minimizations.

We hardly use the constrained minimization nowadays. Unconstrained minimization can be solved using two methods:

Closed form method:

The function to be minimized is simply differentiated and equated to 0 to achieve a solution. The solution is also double differentiated to check if the solution is greater than 0.

Gradient Descent:

It is an iterative minimization method which reaches the minima step by step .You start with an initial assumed value of the parameter. This initial assumed value can be anything (say X_{0}). Then you assume \alpha which is rate of learning. For that value (X_{0}), you calculate the output of the differentiated function which we denote as f^{'}(x). Then the new value of the parameter becomes x -f^{'}(x)*\alpha . You continue the process until the algorithm reaches an optimum point (X_{4}); i.e. the value of the parameter does not change effectively after this point.

Understanding Common Concepts in Linear regression

1 Why do we square the residuals in machine learning?

First of all lets under what are residuals. Residuals is the difference between the predicted value and the actual value (y_pred-y_actual) now come the question why we have to square them, does it really makes any difference, well the answer lies in working of working gradient descent. Lets understand them one by one

The first reason is we want the absolute value of error so that they wont nullify each other

Squaring of residuals will intensify the error which will help in penalizes larger errors more heavily than smaller errors and hence improve our optimization process.

Squared residuals makes differentiable loss function, which is quite important for many optimization algorithms used in machine learning, also it helps in efficient training of the model.

Squared residuals provide us a measure of the variance of the target variable. It can be useful in determining how well the model is fitting the data and this quite important for the model selection.

2 Can we use the RSS to compare models ?

Well RSS (Residual Sum of Squares) is a common measure to calculate the differences between the predicted values and the actual values in a linear regression. It gives the sum of squared difference between the actual and predicted value

There is a general notion that lower the value of RSS better the model but this not completely true one has understand that comparing models solely on their RSS can be misleading because RSS depends on the scale of the outcome variable

Lets Understand this with an example suppose you using two models Model A which predict the age and model B which predicts the Salary as both age and salary are on completely different scale so we cannot directly compare their RSS values. In addition, RSS alone does not tell us anything about the complexity of the model or how well it performs on new to new data

3 What is Q-Q plot?How to interpret the Q-Q plot?

A Q-Q plot (quantile-quantile plot) is a graphical tool which check whether the given dataset is approximately normal distributed .It compares the distribution of the data to a normal distribution by plotting the quantiles of the data against the corresponding theoretical quantiles of a normal distribution.

If the data is normally distributed the Q-Q plot show a straight line with data clustered around the strait line otherwise the data points for a non -linear pattern.

Thing to Note in a Q-Q plot are as follows:

The line of the Q-Q plot must be a linear line.

If the data are normally distributed, the points in the Q-Q plot will be evenly distributed around the line

One can also look for outliers in Q-Q plots, they usually are point which are quite far from the cluster of points

ReLU (Rectified Linear Unit) is one of the most widely used activation functions. It introduces non-linearity into the network so it can learn complex patterns.

What ReLU Is

The ReLU function, defined as:

\[ ReLU(x) = \max(0, x) \]
  • Outputs:
  • 0 when x is less than 0.
  • x when x is greater than or equal to 0.

Why ReLU Is Used in DNNs

  1. Avoids Vanishing Gradients: Compared to sigmoid/tanh, ReLU does not saturate for large positive values. This allows gradients to stay strong → deeper networks can be trained.

  2. Computational Efficiency: Just a max() operation → fast and cheap.

  3. Produces Sparse Activations: Many outputs become zero → reduces computation and can help generalization.

MLP (Multilayer Perceptron)

An MLP is composed of one input layer, one or more layers of TLUs called hidden layers, and one final layer of TLUs called the output layer . The layers close to the input layer are usually called the lower layers, and the ones close to the outputs are usually called the upper layers.

Learning rate (α)

alt text

it is a step size that we will take to move in the direction of local minima and you can also see that while we are approaching minima our step size is constantly decrease. some of you think that we did it manually but this is the effect of slope when we go down then our slope is constantly decrease because of this our update in x which is directly influenced by the multiplication of learning rate and slope is also decrease

what should be the step size(learning rate = α) we generally use learning rate in btw 0.0 to 1.0 . point--> remember that learning rate should not be too high or too low . you can better understand from this figure

alt text