It is very important to learn statistics well for data scientists. Learning visualization tools and data manipulation tools are great! But without the knowledge of statistics, it is not possible to infer some real information from the data.
I wrote several tutorials on different inferential statistics topics. Now I realized that if I combine them together it will become a nice course for learners. Also, each of the articles works on a project with a dataset except the first article. So, learning by doing a project is a great way of learning.
At the same time, you will find some…
With the increasing number of text documents, text document classification has become an important task in data science. At the same time, machine learning and data mining techniques are also improving every day. Both Python and R programming languages have amazing functionalities for text data cleaning and classification.
This article will focus on text documents processing and classification Using R libraries.
The data that is used here is text files packed in a folder named 20Newsgroups. This folder has two subfolders. One of them contains training data and the other one contains the test data. Each subfolder contains 20 folders…
This article focuses on a data storytelling project. In other words Exploratory data analysis. After looking at a big dataset or even a small dataset, it is hard to make sense of it right away. It needs effort, more work, and analysis to extract some meaningful information from that dataset.
In this article, we will take a dataset and use some popular python libraries like Numpy, Pandas, Matplotlib, Seaborn to find some meaningful information from it. And at the end, we will run a prediction model from the scikit-learn library.
As a data scientist or a data analyst, you may…
ANOVA (Analysis of Variance) is a process to compare the means of more than two groups. It can also be used for comparing the means of two groups. But that’s unnecessary. Comparing the means between two groups only can be done using a hypothesis testing method such as a t-test.
If you need a refresher on the t-test or z-test please check this article:
This article will focus on comparing the means of more than two groups using the Analysis of Variance (ANOVA) method. This method breaks down the overall variability of a given continuous outcome into pieces.
One of the most basic, popular, and powerful statistical models is logistic regression. If you are familiar with linear regression, logistic regression is built upon linear regression. It uses the same linear formula just a bit different implementation. This article will discuss the details of logistic regression in R. But for a refresher or better understanding, I will discuss some formulas behind the model.
If you need a refresher on linear regression, please feel free to go through this article:
As I mentioned before this uses the same linear formula as linear regression. Then what is the difference between linear…
Data Visualization is essential if you deal with data in any way. I focus on that a lot. I wrote several articles before on data visualization in Python. I realized if I compile them on one page it may become a huge collection of data plotting techniques in one place. The amount of data visualization you may learn from here might rival any paid visualization course out there.
This is arguably the most popular and most used visualization library in Python. There are other high-quality libraries of python that are built on Matplotlib. Even if you use some other libraries…
Thank you so much for checking out my blog! Actually, readers may read even if it is not in a publication. When it is in a publication, that publication's followers see it in their feed. So the article gets more visibility.
Using subplots and putting multiple plots in one figure can be very useful in summarizing a lot of information in a small space. They are helpful in making reports or presentations. This article will focus on how to use subplots efficiently and take fine control over the grids.
We will start with the basic subplot function to make equal size plots first. Let’s do the necessary imports:
%matplotlib inlineimport matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Here is the basic subplots function in Matplotlib that makes two rows and three columns of equal-sized rectangular space:
The confidence interval, t-test, and z-test are very popular and widely used methods in inferential statistics. They are so important because, for any research or data analysis, we can only use a sample to come to a conclusion about a large population. In that case, these inferential statistical methods help us consider the errors and infer a better estimate for a larger population using a smaller sample.
You may think there is a lot to cover in one article. Yes, they are actually a lot to digest in one day. …
Logistic regression is very popular in machine learning and statistics. It can work on both binary and multiclass classification very well. I wrote tutorials on both binary and multiclass classification with logistic regression before. This article will be focused on image classification with logistic regression.
If you are totally new to logistic regression, please go to this article first. This article has a detailed explanation of how a simple logistic regression algorithm works.
It will be helpful if you are familiar with logistic regression already. If not, I hope you will still understand the concepts here. …