Feature selection is one of the most important parts of machine learning. In most datasets in the real world, there might be many features. But not all the features are necessary for a certain machine learning algorithm. Using too much unnecessary features may cause a lot of problems. The first one is definitely the computation cost. The unnecessarily big dataset will take an unnecessarily long time to run the algorithm. At the same time, it may cause an overfitting problem which is not expected at all.
There are several feature selection methods out there. I will demonstrate four popular feature…
Stochastic gradient descent is a widely used approach in machine learning and deep learning. This article explains stochastic gradient descent using a single perceptron, using the famous iris dataset. I am assuming that you already know the basics of gradient descent. If you need a refresher, please check out this linear regression tutorial which explains gradient descent with a simple machine learning problem.
Before diving into the stochastic gradient descent, let’s have an overview of regular gradient descent. Gradient descent is an iterative algorithm. Let’s put it in a simple example. As I mentioned, I will use a single perceptron:
Pandas library is a very popular python library for data analysis. Pandas library has so many functions. This article will discuss three very useful and widely used functions for data summarizing. I am trying to explain it with examples so we can use them to their full potential.
The three functions I am talking about today are count, value_count, and crosstab.
The count function is the simplest. The value_count can do a bit more and the crosstab function does even more complicated work with simple commands.
The famous Titanic dataset is used for this demonstration. …
Exploratory data analysis is unavoidable to understand any dataset. It includes data summarization, visualization, some statistical analysis, and predictive analysis. This article will focus on data storytelling or exploratory data analysis using R and different packages of R.
This article will cover:
2. Some Basic Statistics
3. Predictive Model
If you are a regular follower of my articles, you might have seen another exploratory data analysis project using the same dataset before in Python. Here is the link:
I am using the same dataset here for performing an exploratory data analysis in…
This is a very common question. Especially for the starters. Where to start? Even for intermediate-level data scientists, this can be a question. Because different people have different choices or different styles of work. Some companies prefer Python and some companies prefer R. I have friends who learned to start Python first and then some recruiters or some employers said they should learn R. Now they start learning R. Actually which one is better?
I started with Python. As I started my MS at Boston University, I had to learn R. Because some of the data analytics courses use R…
Text data analysis is becoming easier and easier every day. Prominent programming languages like Python and R have great libraries for text data analysis. There was a time when people used to think that you need to be an expert in coding to do these types of complex tasks. But with the more developed and improved version of libraries, it is easier to perform text data analysis with just simple and beginner-level coding knowledge.
It is very important to learn statistics well for data scientists. Learning visualization tools and data manipulation tools are great! But without the knowledge of statistics, it is not possible to infer some real information from the data.
I wrote several tutorials on different inferential statistics topics. Now I realized that if I combine them together it will become a nice course for learners. Also, each of the articles works on a project with a dataset except the first article. So, learning by doing a project is a great way of learning.
At the same time, you will find some…
With the increasing number of text documents, text document classification has become an important task in data science. At the same time, machine learning and data mining techniques are also improving every day. Both Python and R programming languages have amazing functionalities for text data cleaning and classification.
This article will focus on text documents processing and classification Using R libraries.
The data that is used here is text files packed in a folder named 20Newsgroups. This folder has two subfolders. One of them contains training data and the other one contains the test data. Each subfolder contains 20 folders…
This article focuses on a data storytelling project. In other words Exploratory data analysis. After looking at a big dataset or even a small dataset, it is hard to make sense of it right away. It needs effort, more work, and analysis to extract some meaningful information from that dataset.
In this article, we will take a dataset and use some popular python libraries like Numpy, Pandas, Matplotlib, Seaborn to find some meaningful information from it. And at the end, we will run a prediction model from the scikit-learn library.
As a data scientist or a data analyst, you may…
ANOVA (Analysis of Variance) is a process to compare the means of more than two groups. It can also be used for comparing the means of two groups. But that’s unnecessary. Comparing the means between two groups only can be done using a hypothesis testing method such as a t-test.
If you need a refresher on the t-test or z-test please check this article:
This article will focus on comparing the means of more than two groups using the Analysis of Variance (ANOVA) method. This method breaks down the overall variability of a given continuous outcome into pieces.