Table of contents
- Introduction to Machine Learning
- Supervised Learning
- Unsupervised Learning
Introduction to Machine Learning
Machine Learning (ML) is a subset of Artificial Intelligence (AI) that enables computers to learn from data and make decisions without explicit instructions.
Key Components of ML:
Algorithms: Used to process data and learn patterns.
Feature Engineering: The process of selecting and transforming variables for model training.
Types of Learning
Supervised Learning:
Uses labeled data.
Ex: Linear Regression, Logistic Regression, Decision Trees, Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Random Forest, Gradient Boosting Machines (GBM), AdaBoost, Naive Bayes.
Unsupervised Learning:
Works with unlabeled data.
Ex. K-Means Clustering, Hierarchical Clustering, DBSCAN (Density-Based Spatial Clustering of Applications with Noise), Principal Component Analysis (PCA), t-Distributed Stochastic, Neighbor Embedding (t-SNE)
Reinforcement Learning:
An agent learns to make decisions by receiving feedback from its environment.
Ex. Q-Learning, Deep Q-Networks (DQN)
Machine Learning Techniques
Classification: Predicts the class of a case (e.g., benign vs. malignant cells).
Regression: Predicts continuous values (e.g., house prices).
Clustering: Groups similar cases (e.g., customer segmentation).
Anomaly Detection: Identifies unusual cases (e.g., fraud detection).
Recommendation Systems: Suggests items based on user preferences.
ML Model Lifecycle
Problem Definition: Identify the issue to be solved. Understanding client needs, such as recommending products based on purchase history.
Data Collection: Gather relevant data from various sources.
Data Preparation: (ETL process = Extract, Transform, Load). Cleaning and transforming data to ensure quality and relevance. Exploratory Data Analysis : Identifying patterns and important features in the data.
Model Development and Evaluation: Build and assess the model's performance.
Model Deployment: Implement the model in a real-world setting.
The lifecycle is iterative, meaning you may revisit earlier steps based on evaluation results.
Python Based Libraries for ML
Supervised Learning
Linear Regression - for regression problem
supervised learning model that predicts a continuous target variable based on explanatory features
1. Simple Linear Regression:
models the relationship between a continuous target variable (like CO2 emissions) and a single independent variable (like engine size).
It uses a scatter plot to visualize the correlation between the two variables, aiming to find a best-fit line.
The model is represented by the equation: y-hat = θ0 + θ1 * x1, where:
y-hat is the predicted value,
θ0 is the y-intercept,
θ1 is the slope (coefficient for the independent variable).
The goal is to minimize the Mean Squared Error (MSE), which measures how well the regression line fits the data.
This method is known as Ordinary Least Squares (OLS) Regression and is straightforward to interpret, but it can be affected by outliers.
Model evaluation
You can compare the actual values and predicted values to calculate the accuracy of a regression model.
Mean Absolute Error: It is the mean of the absolute value of the errors. This is the easiest of the metrics to understand since it’s just an average error.
Mean Squared Error (MSE): MSE is the mean of the squared error. In fact, it's the metric used by the model to find the best fit line, and for that reason, it is also called the residual sum of squares.
Root Mean Squared Error (RMSE): RMSE simply transforms the MSE into the same units as the variables being compared, which can make it easier to interpret.
R-squared : is not an error but rather a popular metric used to estimate the performance of your regression model. It represents how close the data points are to the fitted regression line. The higher the R-squared value, the better the model fits your data. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse).
2. Multiple Linear Regression
Multiple linear regression is an extension of simple linear regression that uses two or more independent variables to estimate a dependent variable.
Logistic Regression - for classification problem
Logistic regression is a statistical modeling technique used to predict the probability of an observation belonging to one of two classes (binary classification).
It is particularly useful when the target variable is binary (0 or 1) and when you need to understand the impact of independent features.
The sigmoid function is used to compress predicted values between 0 and 1, allowing for probability predictions.
A threshold (commonly 0.5) is set to classify observations into one of the two classes based on the predicted probability.
The log-loss function, also known as logistic loss or cross-entropy loss, is a cost function used in logistic regression to measure the performance of a classification model. Its purpose is to quantify how well the predicted probabilities match the actual class labels.
Measure Accuracy: Log-loss evaluates how close the predicted probabilities are to the actual outcomes. A lower log-loss indicates better model performance.
Decision Trees
A Decision Tree is an algorithm used for classifying data points, visualized as a flowchart.
Structure:
Internal nodes represent tests on features.
Branches represent the outcomes of these tests.
Leaf nodes assign data to classes.
Building Process:
Stopping Criteria:
Pruning: Simplifies the tree to avoid overfitting and improve predictive accuracy.
Splitting Criteria: Common measures include information gain and Gini impurity.
Entropy: Measures the disorder in a dataset; lower entropy indicates better classification.
Information Gain: Information Gain quantifies the reduction in uncertainty about the class label after a dataset is split on a particular feature.
Regression Tree
regression trees are a type of decision tree used for predicting continuous values rather than discrete classes.
Difference from Classification Trees:
Creation Process:
Regression trees are built by recursively splitting the dataset to maximize information gain.
The quality of splits is measured using Mean Squared Error (MSE), which assesses the variance of target values within nodes.
Prediction at Leaf Nodes:
- For regression trees, predictions are made using the average of the target values in the node.
Choosing Thresholds:
- Continuous features can be split using various strategies, including sorting values and selecting midpoints between them.
Support Vector Machine ( SVM )
Support Vector Machines (SVMs) are a type of supervised learning algorithm used primarily for classification tasks, but they can also be adapted for regression.
Classification: They classify data by finding a hyperplane that maximizes the margin between different classes.
Support Vectors: The closest data points to the hyperplane, known as support vectors, are crucial for defining the hyperplane's position.
Soft Margin: SVMs can allow some misclassifications through a soft margin, controlled by the parameter C, balancing margin maximization and misclassification minimization.
Kernel Trick: SVMs can use various kernel functions (linear, polynomial, RBF) to transform data into higher dimensions for better separation.
Epsilon: Regression (SVR), epsilon (ε) is a parameter that defines a margin of tolerance around the predicted values.
K-Nearest Neighbors (KNN)
KNN is a supervised machine learning algorithm used for classification and regression tasks.
How It Works: It classifies a new data point based on the majority class of its K nearest neighbors from the training data.
Choosing K: The value of K (number of neighbors) is crucial; a small K can lead to overfitting, while a large K can cause underfitting.
Distance Measurement: KNN calculates the distance between data points to determine which neighbors are closest.
Applications: Commonly used in tasks like classifying flowers, identifying handwritten digits, and more.
Bias-Variance Tradeoff
The bias-variance tradeoff is a key concept in machine learning that describes the balance between two types of errors that affect the performance of a model:
Bias: This error occurs when a model is too simple to capture the underlying patterns in the data. High bias can lead to underfitting, where the model performs poorly on both training and unseen data because it fails to learn the relevant features.
Variance: This error arises when a model is too complex and captures noise in the training data rather than the actual signal. High variance can lead to overfitting, where the model performs well on training data but poorly on unseen data because it is too tailored to the training set.
Tradeoff:
As you increase model complexity (e.g., using more features or a more complex algorithm), bias decreases (the model fits the training data better), but variance increases (the model becomes more sensitive to fluctuations in the training data).
Conversely, simplifying the model increases bias (it may not fit the training data well) but decreases variance (the model is more stable across different datasets).
The goal is to find the right level of complexity that minimizes the total error, balancing bias and variance to achieve good performance on both training and unseen data.
Solutions:
Model Selection: Choose an appropriate model complexity based on the data. Simpler models (like linear regression) may work well for less complex data, while more complex models (like decision trees or neural networks) may be needed for intricate patterns.
Regularization: Techniques like Lasso (L1 regularization) and Ridge (L2 regularization) can help reduce overfitting by adding a penalty for larger coefficients in the model, effectively simplifying it.
Cross-Validation: Use cross-validation techniques to evaluate model performance on different subsets of the data. This helps in selecting a model that generalizes well to unseen data.
Ensemble Methods: Techniques like bagging (e.g., Random Forests) and boosting (e.g., Gradient Boosting) combine multiple models to improve overall performance. Bagging reduces variance, while boosting reduces bias.
Feature Engineering: Carefully selecting and transforming features can help improve model performance. Adding relevant features can reduce bias, while removing irrelevant ones can help reduce variance.
Hyperparameter Tuning: Adjusting hyperparameters (like tree depth in decision trees) can help find the right balance between bias and variance. Techniques like grid search or random search can be used for this purpose.
Bagging And Boosting
bagging and boosting, two popular ensemble methods used in machine learning to improve model performance:
1. Bagging (Bootstrap Aggregating)
Concept: Bagging involves training multiple models (often the same type) on different subsets of the training data and then combining their predictions.
How It Works:
Randomly sample subsets of the training data with replacement (this is called bootstrapping).
Train a separate model on each subset.
Combine the predictions of all models, typically by averaging (for regression) or voting (for classification).
Goal: Bagging aims to reduce variance and prevent overfitting. By averaging the predictions of multiple models, it smooths out the noise and leads to more stable predictions.
Example: Random Forest is a well-known bagging method that uses multiple decision trees trained on bootstrapped datasets.
2. Boosting
Concept: Boosting involves training multiple models sequentially, where each new model focuses on correcting the errors made by the previous ones.
How It Works:
Train the first model on the entire dataset.
Evaluate its predictions and assign higher weights to the misclassified instances.
Train the next model on the updated dataset, which emphasizes the errors of the previous model.
Continue this process for a specified number of iterations or until performance plateaus.
Combine the predictions of all models, often using a weighted sum.
Goal: Boosting aims to reduce bias and improve the model's accuracy by focusing on difficult-to-predict instances.
Example: Popular boosting algorithms include AdaBoost, Gradient Boosting, and XGBoost.
Different Boosting Algorithms
AdaBoost (Adaptive Boosting)
Concept: AdaBoost combines multiple weak learners (models that perform slightly better than random guessing) to create a strong learner.
How It Works:
Initially, all training instances are given equal weights.
A weak learner (often a decision tree with limited depth) is trained on the dataset.
The algorithm evaluates the weak learner's performance and increases the weights of misclassified instances.
A new weak learner is trained on the updated dataset, focusing on the previously misclassified instances.
This process is repeated for a specified number of iterations, and the final model is a weighted sum of all weak learners.
Goal: Improve accuracy by focusing on difficult-to-classify instances.
Gradient Boosting
Concept: Gradient Boosting builds models sequentially, where each new model corrects the errors of the previous one by minimizing a loss function.
How It Works:
Start with an initial model (often a simple one).
Calculate the residuals (errors) of the predictions from the current model.
Train a new model to predict these residuals.
Update the current model by adding the new model's predictions, scaled by a learning rate.
Repeat this process for a specified number of iterations or until performance improves.
Goal: Minimize the loss function and improve model accuracy by iteratively correcting errors.
XGBoost (Extreme Gradient Boosting)
Concept: XGBoost is an optimized implementation of gradient boosting that is designed for speed and performance.
How It Works: Similar to gradient boosting, but with enhancements such as:
Regularization: Helps prevent overfitting by adding penalties for complex models.
Parallel Processing: Utilizes multiple cores for faster computation.
Tree Pruning: Uses a more efficient algorithm for tree construction and pruning.
Handling Missing Values: Automatically learns how to handle missing data during training.
Goal: Provide a highly efficient and scalable boosting algorithm that performs well on large datasets.
Unsupervised Learning
Clustering
Clustering is a machine learning technique that involves grouping a set of data points into clusters based on their similarities.
Unsupervised Learning: Clustering is an unsupervised learning method, meaning it works with unlabeled data and identifies patterns without prior knowledge of the groupings.
Purpose: The main goal is to find natural groupings in the data, which can help in understanding the structure of the data and making informed decisions.
Applications: Used in customer segmentation, anomaly detection, and data summarization.
Types of Clustering Methods:
1. Partition-based Clustering:
Partition-based clustering is a type of clustering method that divides a dataset into distinct, non-overlapping groups or clusters.
Non-overlapping Groups: Each data point belongs to exactly one cluster.
Centroid-Based: Clusters are often represented by a central point (centroid), which is the mean of all points in the cluster.
Minimizing Variance: The goal is to minimize the variance within each cluster while maximizing the variance between clusters.
Ex. k-means clustering
2. Density-based Clustering:
Density-based clustering is a clustering method that groups data points based on the density of data points in a region.
Cluster Formation: Clusters are formed in areas of high density separated by areas of low density.
Arbitrary Shapes: Unlike partition-based methods, density-based clustering can identify clusters of various shapes and sizes.
Noise Handling: It can effectively identify and handle noise or outliers in the dataset.
Ex. DBSCAN
Hierarchical Clustering:
Hierarchical clustering is a clustering method that builds a hierarchy of clusters, creating a tree-like structure called a dendrogram.
Agglomerative: A bottom-up approach where each data point starts as its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
Divisive: A top-down approach where all data points start in one cluster, and the cluster is recursively split into smaller clusters.
K-Means Clustering
K-Means is an iterative, centroid-based clustering algorithm that partitions a dataset into k non-overlapping clusters based on the distance between their centroids.
Centroid: The centroid is the average position of all points in a cluster, and data points nearest to a centroid are grouped together.
Choosing K: A higher k value results in smaller clusters with more detail, while a lower k value leads to larger clusters with less detail.
Algorithm Steps:
Initialization:
Choose the number of clusters, k.
Randomly select k initial centroids (the center points of the clusters).
Assignment Step:
For each data point, calculate the distance to each centroid.
Euclidian distance:
Assign each data point to the cluster with the nearest centroid.
Update Step:
- Recalculate the centroids by taking the mean of all data points assigned to each cluster.
Iteration:
- Repeat the assignment and update steps until the centroids no longer change significantly or a maximum number of iterations is reached.
- Observe how centroids changed in each iteration.
Drawbacks of K-means:
Imbalanced Clusters: K-Means assumes that clusters are of similar size. If one cluster has significantly more points than another, the algorithm may struggle to accurately represent the smaller cluster.
Sensitivity to Outliers: Outliers can significantly affect the position of centroids, leading to poor clustering results. K-Means is sensitive to noise and outliers.
Choosing the Wrong K: If the number of clusters (k) is not chosen appropriately, it can lead to either overfitting (too many clusters) or underfitting (too few clusters), resulting in poor clustering performance.
Convergence to Local Minima: K-Means may converge to a local minimum rather than the global minimum, depending on the initial placement of centroids. Different initializations can lead to different clustering results.
DBSCAN and HDBSCAN Clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
Creates clusters based on a user-defined density value.
Creates clusters centered around spatial centroids.
Identifies core points (with enough neighbors), border points (near core points but not dense enough), and noise points (isolated).
Can discover clusters of any shape and size, making it effective for datasets with noise or unknown cluster numbers.
Distinguishes between data points that are part of a cluster and noise. useful for data with outliers or when cluster number is unknown.
HDBSCAN (Hierarchical DBSCAN):
A variant of DBSCAN that does not require parameter setting.
Uses cluster stability to adaptively adjust neighborhood sizes, resulting in more meaningful clusters.
Combines agglomerative and density-based clustering, creating a hierarchy of clusters.
Dimension Reduction
Dimension reduction plays a crucial role in clustering by:
Simplifying Data Structure: It reduces the number of features in high-dimensional data, making it easier to analyze and visualize.
Improving Clustering Efficiency: High-dimensional data can lead to sparsity, making it difficult for clustering algorithms to find meaningful patterns. Dimension reduction helps mitigate this issue.
Enhancing Visualization: By projecting high-dimensional data into two or three dimensions, it allows for better visual interpretation of clustering results, such as creating scatter plots to assess cluster quality.
Overall, dimension reduction serves as a preprocessing step that enhances the effectiveness of clustering algorithms.
Clustering, Dimension Reduction, and Feature Engineering
Clustering: A technique used to group similar data points, aiding in feature selection and creation.
Dimension Reduction: Simplifies data structures, making it easier to visualize and process high-dimensional data. Techniques like PCA, t-SNE, and UMAP are commonly used.
Feature Engineering: Involves selecting and transforming features to improve model performance. Clustering can help identify redundant features, allowing for more efficient modeling.
Dimension Reduction Algorithms
These algorithms reduce the number of features in a dataset while preserving essential information, making it easier to analyze and visualize high-dimensional data.
PCA (Principal Component Analysis): A linear method that transforms features into uncorrelated variables (principal components) while retaining variance.
t-SNE (T-Distributed Stochastic Neighbor Embedding): A nonlinear method that maps high-dimensional data to lower dimensions, focusing on preserving local similarities.
UMAP (Uniform Manifold Approximation and Projection): Another nonlinear method that constructs a high-dimensional graph and optimizes a low-dimensional representation, often outperforming t-SNE in clustering.
This article provides an introduction to Machine Learning (ML), covering key components like algorithms and feature engineering. It explores different types of learning, including supervised, unsupervised, and reinforcement learning, and outlines various ML techniques such as classification, regression, clustering, anomaly detection, and recommendation systems. The ML model lifecycle is detailed, from problem definition to model deployment. The article also examines supervised learning algorithms like linear and logistic regression, decision trees, SVMs, and KNN, as well as ensemble methods like bagging and boosting. Unsupervised learning methods, such as k-means and DBSCAN clustering, are discussed alongside dimension reduction techniques for simplifying data. Concepts like bias-variance tradeoff and the importance of feature engineering and hyperparameter tuning are also highlighted.