Introduction to Machine Learning

Photo by Andy Kelly on Unsplash

Target Audience

This article is for students and professionals who have an interest in Machine Learning and are starting their journey in this field. You can consider this article as the gist of machine learning which will help you create basic foundations and clarity regarding this field.

What is machine learning?

The technical definition of machine learning, Mitchell, 1997: Improvement with experience at some task. For a well-defined ML problem:

  • An improvement over task T
  • wrt performance measure P
  • Based on experience E

Suppose if you are creating a model which can play chess, your task will be to play chess and win, your performance measure will be whether you are winning the game or not, and finally, the experience will be gained by the number of games/iterations you will play.

How is Machine Learning different from Traditional Learning?

In traditional programming, we provided the hardcoded logic/program whereas in ML we try to learn that program based on the data and output provided.

Applications of Machine Learning

You all are already surrounded by Social Media which uses face recognition and content recommendation, e-commerce for products recommendations, virtual personal assistants like Alexa. AI/ML is used for autonomous vehicles and stock price predictions. Check out this interesting site where each time you refresh the machine learning model generates a new face which probably does exist in the world:

Types of ML learning

There are two kinds of data — labeled data and unlabeled data. Labeled data has both the input and output parameters in a completely machine-readable pattern, but requires a lot of human labor to label the data, to begin with. Unlabeled data only have one or none of the parameters in a machine-readable form.

  • Supervised Learning: The machine learning algorithm is trained on labeled data. Even though the data needs to be labeled accurately for this method to work, supervised learning is extremely powerful when used in the right circumstances. (Given: training data + desired outputs aka labels).
  • Unsupervised Learning: These algorithms work with unlabeled data, resulting in the creation of hidden structures. Relationships between data points are perceived by the algorithm in an abstract manner, with no input required from human beings. (Given: training data without desired outputs).
  • Semi-Supervised Learning: Approach to machine learning that combines a small amount of labeled data with a large amount of unlabeled data during training. Semi-supervised learning falls between unsupervised learning (with no labeled training data) and supervised learning (with only labeled training data). It is a special instance of weak supervision.
  • Reinforcement Learning: Based on rewarding desired behaviors and/or punishing undesired ones. In general, a reinforcement learning agent is able to perceive and interpret its environment, take actions and learn through trial and error.

Major Types of Tasks

  • Classification: Subsequent data will fall into one of the predetermined categories. One of the most common uses of classification is filtering emails into “spam” or “non-spam.”
  • Regression: Used to predict continuous values. The ultimate goal of the regression algorithm is to plot a best-fit line or a curve between the data.
  • Clustering: Involves the grouping of data points. Given a set of data points, we can use a clustering algorithm to classify each data point into a specific group.
  • Co-training: It’s a semi-supervised learning technique, used when there are only small amounts of labeled data and large amounts of unlabeled data.
  • Relationship discovery: To find associations between different entities.
  • Reinforcement Learning: How intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward.


To understand how the variation in an independent variable can impact the dependent variable, regression analysis is specially molded out. Example: Simple Linear Regression. Y = bX+C, where Y is a dependent variable and X, is the independent variable, that shows a best fitted straight line(regression curve) having b as the slope of the line and C intercept.

Multiple linear regression: we need to find values of theta in such a way that the error is minimum between the actual values and predicted values.

How are weights calculated?

Weights are found using least square fits. Learning objective: find values for θ such that the value output by hθ(x) for a given vector x is as close to y as possible.

J is convex with a single minimum ( example: y = x)

Optimal weights are found using the Gradient Descent Algorithm. Gradient Descents: finding a local minimum of any continuously differentiable function. Take incremental steps ‘downhill’ with step size by learning rate alpha until little/no change.

Polynomial Regression

By its nature, linear regression only looks at linear relationships between dependent and independent variables. That is, it assumes there is a straight-line relationship between them. Sometimes this is incorrect. For example, the relationship between income and age is curved, i.e., income tends to rise in the early parts of adulthood, flatten out in later adulthood, and decline after people retire. When to execute a model that is fit to manage non-linearly separated data, the polynomial regression technique is used. In it, the best-fitted line is not a straight line, instead, a curve that best-fitted to data points. In the below example you can see how a curve is a better choice for hypothesis description and that is why polynomial regression works better in this case.

Various hypotheses which one to consider

At times you will face this dilemma when you have to choose a single hypothesis amongst a number of hypotheses. The Answer: Ockham’s Razor Prefer the simplest hypothesis consistent with data but it depends strongly on how a hypothesis is expressed. There will be a trade-off between the complexity of the hypothesis and how well it fits the data, but always prefer the simplest possible hypothesis which for the below case is a straight line.

Understanding Bias, Variance, Underfitting, and Overfitting

One of the most important concepts of machine learning. Variance errors are defined as over-sensitivity to small changes in training data. High variance means the algorithm may fit random noise in training data, rather than real relationships in data which leads to over-fitting. Bias errors are due to bad assumptions in the learning algorithm. High bias means algorithm cannot capture all relevant relationships between attributes and output leading to underfitting.

ML Algorithms with complex hypothesis language are prone to overfitting. Those with simpler hypothesis language are prone to underfitting. For example, linear regression involves a simple hypothesis as it uses straight lines, so it is prone to underfitting caused due to high bias and low variance. Whereas polynomial regression uses a more complex hypothesis which can cause a highly overfitted model caused due to higher variance and weaker bias.

Bias/Variance tradeoff: If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand, if our model has a large number of parameters then it’s going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data.


Classification is the process of predicting the class of given data points. Classification predictive modeling is the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y). The decision tree is one of the most common classification algorithms. It utilizes an if-then rule set that is mutually exclusive and exhaustive for classification. The rules are learned sequentially using the training data one at a time. A node (where two branches intersect) is a decision point. Nodes partition the data. The terminal nodes are called leaves; these specify the target label for an item. Splits are performed based on some specific metric like entropy. Random forest is basically we are using multiple decision trees.

For the above dataset, below is the decision tree created using the algorithms that utilize the entropy function to create splits at any point/node of the tree.

K-Nearest Neighbour

Base prediction on several (k) nearest neighbors. Compute distance from query case to all stored cases, and pick the nearest k neighbors. Neighbors vote on the classification of the test cases. We can use Euclidean or Manhattan distance for finding distance between two points.

1-Nearest Neighbour algorithm: Simplest similarity-based/instance-based method. Choose the nearest one; assign the test case to have the same label (class or regression value) as this one. But to reduce susceptibility to noise, we use more than 1 neighbor.

Logistic Regression

Suppose we want to build a Yes/No classification model. Our Goal: Given sample x, find function f(x) that correctly classifies it. We use activation functions something like sigmoid etc to covert a normal regression model to a logistic regression model which is a classification model. Go from the linear regression formula: hθ(x) = θx To this: hθ(x) = g(θx) where g(z)is called the sigmoidal or logistic function. The output from the hypothesis is the estimated probability. The below diagram shows how the sigmoid function changes with respect to input, see how it converts any input into an output of range 0 to 1.

Also, now the cost function for logistic regression cannot be defined using a mean square error function or any other regression-based error function, as we dealing with classes and therefore we use the cross-entropy function defined as below.


It’s an unsupervised learning method, aims to find meaningful structures and groups in the datasets. Clustering is a Machine Learning technique that involves the grouping of data points. In theory, data points that are in the same group should have similar properties and/or features, while data points in different groups should have highly dissimilar properties and/or features.

K-Means Clustering Algorithm

Most well-known clustering algorithm. Iterative algorithms that divide a group of n data points into k clusters based on the similarity and their mean distance from the centroid of the particular subgroup. You have to decide the value of k (number of clusters to be formed). To process the learning data, the K-means algorithm in data mining starts with the first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids. Stop when the centroids have stabilized (no change in their values).

The input data and final output after running a clustering algorithm are shown below.

Closing remarks

Well, this article is just an introduction to machine learning and in this particular section, I have not focused on Deep Learning. The most important giveaway from this work is knowing the most popular kinds of problems machine learning can be used to solve and some common well-known algorithms which can be used to tackle those problems.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store