Introduction to Deep Learning and Neural Networks

Introduction to Deep Learning and Neural Networks

What is Deep Learning?

Deep learning is a subset of machine learning that leverages neural networks with multiple layers (hence "deep") to analyze and understand complex data. Inspired by how the human brain processes information, it enables computers to learn from large datasets without explicit programming for every task.

Why Deep Learning Matters?

  1. Solving Complex Problems:
    Deep learning excels in tasks like image recognition, speech processing, and natural language understanding—areas that traditional algorithms struggle to handle.

  2. Automation:
    By automating feature extraction, deep learning reduces the need for manual intervention in data processing, speeding up workflows.

  3. Scalability:
    It efficiently handles massive datasets, making it ideal for big data applications across industries.

  4. Real-World Applications:

    • Healthcare: Assisting in diagnosis and medical imaging.

    • Finance: Enhancing fraud detection systems.

    • Autonomous Vehicles: Enabling navigation and object detection.

Recent Advancements in Deep Learning

Deep learning continues to innovate, transforming diverse fields:

  • Color Restoration: Neural networks restore colors to grayscale images with remarkable accuracy.

  • Speech Enactment: Audio clips are synced with video, demonstrated with synthesized videos of public figures like Barack Obama.

  • Handwriting Generation: Algorithms create realistic cursive handwriting in various styles.

  • Other Innovations: Machine translation, adding sound to silent movies, object classification, and self-driving cars.

Neurons and Neural Networks

Deep learning revolves around neurons and their interconnected networks. Let’s break this down, starting from biological neurons to their artificial counterparts.

Biological Neurons

  • Neurons are the brain's basic units, consisting of:

    • Soma: Main body of the neuron.

    • Dendrites: Branch-like structures that receive signals.

    • Axon: A long arm that sends messages to other neurons.

  • Learning occurs through strengthened connections between neurons based on repeated activation.

Artificial Neurons

Artificial neurons mimic biological ones:

  • They process input data (like dendrites), compute a weighted sum (soma), and pass it to other neurons (axon).

  • Learning happens by adjusting weights and biases, improving the network’s accuracy over time.

Mathematical Formulation of Neural Networks

Neural networks are structured to resemble biological neurons, with each artificial neuron functioning as a computational unit. These neurons are organized into layers:

  1. Input Layer: Accepts raw data as input.

  2. Hidden Layers: Perform computations and extract features.

  3. Output Layer: Produces the final result or prediction.

Each neuron in the network processes data mathematically:

  • Inputs (x1,x2,…,xn​): Data fed into the network.

  • Weights (w1,w2,…,wn​): Adjust the importance of each input.

  • Bias (b): Adds flexibility to the model.

The neuron computes a weighted sum of the inputs:

To make this output useful for complex tasks, it is passed through an activation function (e.g., sigmoid, ReLU), giving the neuron’s final output:

Wights:

  • Each connection between neurons has an associated weight that adjusts the strength of the input signal.

  • Higher weights increase the influence of the corresponding input on the neuron's output, while lower weights reduce it.

Bias:

  • The bias is an additional parameter added to the weighted sum, allowing the model to shift the activation function.

  • It helps the model fit the data better by providing flexibility in the output, even when all inputs are zero.

Together, weights and biases enable the neural network to learn complex patterns in the data by adjusting these parameters during training.

Forward Propagation

Forward propagation is the process of passing input data through a neural network to generate an output. Here's how it works:

  1. Input Layer: Accepts input features.

  2. Weights and Biases: Each input is multiplied by a weight, and a bias is added.

  3. Linear Combination: Computes z=∑(w⋅x)+b.

  4. Activation Function: Applies a non-linear function (e.g., ReLU, Sigmoid) to z to produce the neuron’s output.

  5. Hidden Layers: The output of one layer serves as the input for the next.

  6. Output Layer: Produces the final prediction.

Forward propagation calculates the output in a step-by-step manner, preparing for backpropagation during training.

Gradient Descent

Gradient Descent is like a treasure hunt to find the lowest point in a hilly terrain. Imagine standing on a hill and wanting to reach the valley below. To do this, you look for the steepest downward slope and take a step in that direction. You repeat this process, checking the slope and stepping down each time, until you reach the valley—the point with the lowest cost.

In data, Gradient Descent is an algorithm which helps us determine the best value for a variable (like "w") to fit a line to data points. The goal is to minimize the gap between the actual data points and the line. Similar to adjusting your steps on a hill based on steepness, Gradient Descent adjusts "w" using the slope of the cost function.

Process:

  1. Start with an initial guess for "w."

  2. Compute the slope (gradient) of the cost function.

  3. Take a step to reduce the cost, controlled by a parameter called the learning rate.

  4. Repeat until the minimum cost is achieved or nearly reached.

Cost Function

The cost function in gradient descent is used to quantify the error between a model's predictions and the actual data. A common cost function is Mean Squared Error (MSE), which is calculated as:

Where:

  • n is the number of data points,

  • yi is the actual value for the i-th data point,

  • y^i​ is the predicted value for the i-th data point,

The goal of gradient descent is to minimize this cost function, adjusting model parameters (like weights) to reduce the error between predictions and actual values. The function typically has a parabolic shape with a single global minimum, which makes it easier to optimize.

the graph below, shows how line changes with value of w; as j(w) lowers the line align more with data points.

Learning Rate

The learning rate is a key parameter in the gradient descent algorithm that dictates the size of the steps taken towards minimizing the cost function. It controls how much the model's parameters (such as weights) are adjusted based on the calculated gradient.

Step Size:

  • A larger learning rate leads to bigger steps, speeding up convergence but possibly missing the minimum and causing the algorithm to diverge.

  • A smaller learning rate results in smaller steps, allowing more precise convergence but making the process slower.

Typical Values:
Learning rates typically range from 0 to 1, with common values like 0.01, 0.001, or 0.1.

BackPropagation

Backpropagation is a key algorithm used in training neural networks. It is a method for optimizing the weights of the network by minimizing the error in the predictions.

  1. Forward Pass: The input is passed through the network to make a prediction (output).

  2. Error Calculation: The difference (error) between the predicted output and the actual output is calculated using a loss function.

  3. Backward Pass: The error is then propagated back through the network, starting from the output layer to the input layer, calculating the gradient of the error with respect to each weight.

  4. Weight Update: The weights are adjusted using the calculated gradients to minimize the error, typically through the gradient descent algorithm.

This process repeats iteratively, helping the model learn by gradually improving the weights to reduce the prediction error.

Activation Functions in Neural Networks

Activation functions are crucial components in neural networks that introduce non-linearity, enabling them to learn and model complex patterns. Here's a closer look at their purpose, types, and impact.

Purpose of Activation Functions

  1. Decision Making

    • Determines whether a neuron should be activated.

    • Helps the network decide which features are important for prediction.

  2. Non-Linearity

    • Introduces non-linear transformations, enabling the network to capture intricate relationships in data.

    • Without activation functions, a network is essentially a linear regression model, incapable of solving complex problems.

  3. Output Transformation

    • Maps the weighted sum of inputs into a specific range (e.g., 0 to 1 for binary classification).
  4. Gradient Propagation

    • Ensures gradients flow during backpropagation, crucial for optimizing weights and biases in the network.

Common Activation Functions

  • Sigmoid Function

    • Outputs values between 0 and 1.

    • Ideal for binary classification problems.

    • The vanishing gradient problem was a major issue with the sigmoid activation function, which hindered the development of neural networks.

    • The sigmoid function compresses its output to a range between 0 and 1, and when used in backpropagation, the gradients become smaller as they propagate backward through the network. This makes it harder for neurons in the early layers to learn and slows down the training process.

  • ReLU (Rectified Linear Unit)

    • It outputs 0 for negative inputs and passes positive values unchanged, which makes it sparse and more efficient.

    • Resolves the vanishing gradient problem in deep networks.

  • Tanh (Hyperbolic Tangent)

    • Similar to sigmoid but with an output range of -1 to +1, which makes it symmetric around zero.

    • However, it still faces the vanishing gradient problem in deep networks.

  • Softmax: Typically used in the output layer for classification tasks. It transforms the output into a probability distribution, making it easier to classify data points into categories.

Note:

  • the sigmoid and the tanh functions are avoided in many applications nowadays since they can lead to the vanishing gradient problem.

  • The ReLU function is the function that's widely used nowadays, and it's important to note that it is only used in the hidden layers.

  • Finally, when building a model, you can begin with using the ReLU function and then you can switch to other activation functions if the ReLU function does not yield a good performance.