What is Deep Learning?

Deep learning is a subset of machine learning that leverages neural networks with multiple layers (hence "deep") to analyze and understand complex data. Inspired by how the human brain processes information, it enables computers to learn from large datasets without explicit programming for every task.

Why Deep Learning Matters?

Solving Complex Problems:
Deep learning excels in tasks like image recognition, speech processing, and natural language understanding—areas that traditional algorithms struggle to handle.
Automation:
By automating feature extraction, deep learning reduces the need for manual intervention in data processing, speeding up workflows.
Scalability:
It efficiently handles massive datasets, making it ideal for big data applications across industries.
Real-World Applications:
- Healthcare: Assisting in diagnosis and medical imaging.
- Finance: Enhancing fraud detection systems.
- Autonomous Vehicles: Enabling navigation and object detection.

Recent Advancements in Deep Learning

Deep learning continues to innovate, transforming diverse fields:

Color Restoration: Neural networks restore colors to grayscale images with remarkable accuracy.
Speech Enactment: Audio clips are synced with video, demonstrated with synthesized videos of public figures like Barack Obama.
Handwriting Generation: Algorithms create realistic cursive handwriting in various styles.
Other Innovations: Machine translation, adding sound to silent movies, object classification, and self-driving cars.

Neurons and Neural Networks

Deep learning revolves around neurons and their interconnected networks. Let’s break this down, starting from biological neurons to their artificial counterparts.

Biological Neurons

Neurons are the brain's basic units, consisting of:
- Soma: Main body of the neuron.
- Dendrites: Branch-like structures that receive signals.
- Axon: A long arm that sends messages to other neurons.
Learning occurs through strengthened connections between neurons based on repeated activation.

Artificial Neurons

Artificial neurons mimic biological ones:

They process input data (like dendrites), compute a weighted sum (soma), and pass it to other neurons (axon).
Learning happens by adjusting weights and biases, improving the network’s accuracy over time.

Mathematical Formulation of Neural Networks

Neural networks are structured to resemble biological neurons, with each artificial neuron functioning as a computational unit. These neurons are organized into layers:

Input Layer: Accepts raw data as input.
Hidden Layers: Perform computations and extract features.
Output Layer: Produces the final result or prediction.

Each neuron in the network processes data mathematically:

Inputs (x1,x2,…,xn): Data fed into the network.
Weights (w1,w2,…,wn): Adjust the importance of each input.
Bias (b): Adds flexibility to the model.

The neuron computes a weighted sum of the inputs:

To make this output useful for complex tasks, it is passed through an activation function (e.g., sigmoid, ReLU), giving the neuron’s final output.

Activation functions are a critical component of neural networks that introduce non-linearity into the model, allowing networks to learn complex patterns and relationships in the data. These functions play an important role in the hyper parameters of AI-based models.

Without nonlinearity, a neural network would only function as a simple linear regression model.

Wights:

Each connection between neurons has an associated weight that adjusts the strength of the input signal.
Higher weights increase the influence of the corresponding input on the neuron's output, while lower weights reduce it.

Bias:

The bias is an additional parameter added to the weighted sum, allowing the model to shift the activation function.
It helps the model fit the data better by providing flexibility in the output, even when all inputs are zero.

Together, weights and biases enable the neural network to learn complex patterns in the data by adjusting these parameters during training.

Forward Propagation

Forward propagation is the process of passing input data through a neural network to generate an output. Here's how it works:

Input Layer: Accepts input features.
Weights and Biases: Each input is multiplied by a weight, and a bias is added.
Linear Combination: Computes z=∑(w⋅x)+b.
Activation Function: Applies a non-linear function (e.g., ReLU, Sigmoid) to z to produce the neuron’s output.
Hidden Layers: The output of one layer serves as the input for the next.
Output Layer: Produces the final prediction.

Forward propagation calculates the output in a step-by-step manner, preparing for backpropagation during training.

BackPropagation

Backpropagation adjusts a network's weights and biases using gradients to minimize the cost. This determines the level of adjustments required to the weights, biases, and activation functions. These adjustments are then propagated backward through the network to minimize the gap between the actual output vector and target outputs.

Forward Pass: The input is passed through the network to make a prediction (output).
Error Calculation: The difference (error) between the predicted output and the actual output is calculated using a loss function.
Backward Pass: The error is then propagated back through the network, starting from the output layer to the input layer, calculating the gradient of the error with respect to each weight.
Weight Update: The weights are adjusted using the calculated gradients to minimize the error, typically through the gradient descent algorithm.

This process repeats iteratively, helping the model learn by gradually improving the weights to reduce the prediction error.

Gradient Descent

Gradient Descent is like a treasure hunt to find the lowest point in a hilly terrain. Imagine standing on a hill and wanting to reach the valley below. To do this, you look for the steepest downward slope and take a step in that direction. You repeat this process, checking the slope and stepping down each time, until you reach the valley—the point with the lowest cost.

In data, Gradient Descent is an algorithm which helps us determine the best value for a variable (like "w") to fit a line to data points. The goal is to minimize the gap between the actual data points and the line. Similar to adjusting your steps on a hill based on steepness, Gradient Descent adjusts "w" using the slope of the cost function.

Process:

Start with an initial guess for "w."
Compute the slope (gradient) of the cost function.
Take a step to reduce the cost, controlled by a parameter called the learning rate.
Repeat until the minimum cost is achieved or nearly reached.

Cost Function

The cost function in gradient descent is used to quantify the error between a model's predictions and the actual data. A common cost function is Mean Squared Error (MSE), which is calculated as:

Where:

n is the number of data points,
yi is the actual value for the i-th data point,
y^i is the predicted value for the i-th data point,

The goal of gradient descent is to minimize this cost function, adjusting model parameters (like weights) to reduce the error between predictions and actual values. The function typically has a parabolic shape with a single global minimum, which makes it easier to optimize.

the graph below, shows how line changes with value of w; as j(w) lowers the line align more with data points.

Learning Rate

The learning rate is a key parameter in the gradient descent algorithm that dictates the size of the steps taken towards minimizing the cost function. It controls how much the model's parameters (such as weights) are adjusted based on the calculated gradient.

Step Size:

A larger learning rate leads to bigger steps, speeding up convergence but possibly missing the minimum and causing the algorithm to diverge.
A smaller learning rate results in smaller steps, allowing more precise convergence but making the process slower.

Typical Values:
Learning rates typically range from 0 to 1, with common values like 0.01, 0.001, or 0.1.

Difference between Loss Function and Cost Function

The loss function and cost function are both essential concepts in machine learning and neural networks, but they serve slightly different purposes.

Loss Function

The loss function measures the error for a single training example. It quantifies how well the model's predictions match the actual target values.
It provides feedback to the model during training, guiding the adjustments of weights.
Example: Common loss functions include:
• Regression – MSE(Mean Squared Error)-L2 Loss, MAE(Mean Absolute Error)-L1 Loss, Hubber loss
• Classification – Binary cross-entropy, Categorical cross-entropy
• AutoEncoder – KL Divergence
• GAN – Discriminator loss, Minmax GAN loss
• Word embeddings – Triplet loss

Cost Function

The cost function is the average of the loss function over the entire training dataset. It represents the overall error of the model.
It is used to evaluate the model's performance across all training examples and is minimized during the training process.
Example: If using MSE as the loss function, the cost function would be the average of the MSE calculated for all training examples.

Activation Functions in Neural Networks

Activation functions are crucial components in neural networks that introduce non-linearity, enabling them to learn and model complex patterns.

Purpose of Activation Functions

Decision Making
- Determines whether a neuron should be activated.
- Helps the network decide which features are important for prediction.
Non-Linearity
- Introduces non-linear transformations, enabling the network to capture intricate relationships in data.
- Without activation functions, a network is essentially a linear regression model, incapable of solving complex problems.
Output Transformation
- Maps the weighted sum of inputs into a specific range (e.g., 0 to 1 for binary classification).
Gradient Propagation
- Ensures gradients flow during backpropagation, crucial for optimizing weights and biases in the network.

What is vanishing gradient problem?

The vanishing gradient problem occurs in neural networks during the training process, particularly when using activation functions like the sigmoid function.

Gradient Descent: Neural networks learn by adjusting weights based on the gradients (derivatives w.r.t weights) of the loss function. These gradients indicate how much to change the weights to minimize the error.
Sigmoid Activation Function: When using the sigmoid function, the output values are constrained between 0 and 1. This means that the gradients can become very small (close to zero) as they are propagated back through the network during training.
Effect on Training: As the gradients are multiplied through each layer during backpropagation, they can diminish significantly, especially in earlier layers. This leads to:
- Slow Learning: Neurons in the earlier layers learn very slowly compared to those in later layers.
- Long Training Times: The overall training process takes longer, and the model may not converge effectively.

Common Activation Functions

Binary Step Function:

The input fed to the activation function is compared to a certain threshold; if the input is greater than it, then the neuron is activated, else it is deactivated, meaning that its output is not passed on to the next hidden layer.

Limitations of binary step function:

It cannot provide multi-value outputs—for example, it cannot be used for multi-class classification problems. The gradient of the step function is zero, which causes a hindrance in the back propagation process.

Sigmoid Function

- Outputs values between 0 and 1.
  - Ideal for binary classification problems.
  - The vanishing gradient problem was a major issue with the sigmoid activation function, which hindered the development of neural networks.
  - The sigmoid function compresses its output to a range between 0 and 1, and when used in backpropagation, the gradients become smaller as they propagate backward through the network. This makes it harder for neurons in the early layers to learn and slows down the training process.

ReLU (Rectified Linear Unit)

- It outputs 0 for negative inputs and passes positive values unchanged, which makes it sparse and more efficient.
  - Resolves the vanishing gradient problem in deep networks.

The negative side of the graph makes the gradient value zero. Due to this reason, during the backpropagation process, the weights and biases for some neurons are not updated. This can create dead neurons which never get activated.

All the negative input values become zero immediately, which decreases the model’s ability to fit or train from the data properly.

Leaky ReLU: By making this minor modification for negative input values, the gradient of the left side of the graph comes out to be a non-zero value.

Tanh (Hyperbolic Tangent)

- Similar to sigmoid but with an output range of -1 to +1, which makes it symmetric around zero.
  - However, it still faces the vanishing gradient problem in deep networks.

Softmax:

Typically used in the output layer for classification tasks. It transforms the output into a probability distribution, making it easier to classify data points into categories.

Note:

the sigmoid and the tanh functions are avoided in many applications nowadays since they can lead to the vanishing gradient problem.
The ReLU function is the function that's widely used nowadays, and it's important to note that it is only used in the hidden layers.
Finally, when building a model, you can begin with using the ReLU function and then you can switch to other activation functions if the ReLU function does not yield a good performance.

Introduction to Deep Learning and Neural Networks

What is Deep Learning?

Why Deep Learning Matters?

Recent Advancements in Deep Learning

Neurons and Neural Networks

Biological Neurons

Artificial Neurons

Mathematical Formulation of Neural Networks

Forward Propagation

BackPropagation

Gradient Descent

Process:

Cost Function

Learning Rate

Difference between Loss Function and Cost Function

Activation Functions in Neural Networks

Purpose of Activation Functions

What is vanishing gradient problem?

Common Activation Functions

Binary Step Function:

Sigmoid Function

ReLU (Rectified Linear Unit)

Tanh (Hyperbolic Tangent)

Softmax:

Comments

Mastering Deep Learning and Generative AI

Deep Learning Libraries: TensorFlow, Keras, PyTorch & Models

More from this blog

Git & GitHub: A Complete Guide & Cheatsheet

State Management in React:- Context API and Redux Toolkit

Master Linux: A Comprehensive Guide from Basics to Advanced

Axios, JWT, useContext, and User Authentication

Setting Up MongoDB Atlas with Node.js, Express, and Mongoose

Command Palette

What is Deep Learning?

Why Deep Learning Matters?

Recent Advancements in Deep Learning

Neurons and Neural Networks

Biological Neurons

Artificial Neurons

Mathematical Formulation of Neural Networks

Forward Propagation

BackPropagation

Gradient Descent

Process:

Cost Function

Learning Rate

Difference between Loss Function and Cost Function

Activation Functions in Neural Networks

Purpose of Activation Functions

What is vanishing gradient problem?

Common Activation Functions

Binary Step Function:

Sigmoid Function

ReLU (Rectified Linear Unit)

Tanh (Hyperbolic Tangent)

Softmax:

Comments

Mastering Deep Learning and Generative AI

Deep Learning Libraries: TensorFlow, Keras, PyTorch & Models

More from this blog