Photo by Evgeni Tcherkasski on Unsplash

Performing the Feature Selection Methods with a Real Dataset and Retrieve the Selected Features After Each Method

Feature selection is one of the most important parts of machine learning. In most datasets in the real world, there might be many features. But not all the features are necessary for a certain machine learning algorithm. Using too much unnecessary features may cause a lot of problems. The first one is definitely the computation cost. The unnecessarily big dataset will take an unnecessarily long time to run the algorithm. At the same time, it may cause an overfitting problem which is not expected at all.

There are several feature selection methods out there. I will demonstrate four popular feature…

Photo by Allef Vinicius on Unsplash

Using a Single Perceptron

Stochastic gradient descent is a widely used approach in machine learning and deep learning. This article explains stochastic gradient descent using a single perceptron, using the famous iris dataset. I am assuming that you already know the basics of gradient descent. If you need a refresher, please check out this linear regression tutorial which explains gradient descent with a simple machine learning problem.

What is a Stochastic Gradient Descent?

Before diving into the stochastic gradient descent, let’s have an overview of regular gradient descent. Gradient descent is an iterative algorithm. Let’s put it in a simple example. As I mentioned, I will use a single perceptron:

Photo by Melissa Askew on Unsplash

Pandas count, value_count, and crosstab functions in details

Pandas library is a very popular python library for data analysis. Pandas library has so many functions. This article will discuss three very useful and widely used functions for data summarizing. I am trying to explain it with examples so we can use them to their full potential.

The three functions I am talking about today are count, value_count, and crosstab.

The count function is the simplest. The value_count can do a bit more and the crosstab function does even more complicated work with simple commands.

The famous Titanic dataset is used for this demonstration. …

Photo by Courtney Cook on Unsplash

Extracting Meaning of a Dataset

Exploratory data analysis is unavoidable to understand any dataset. It includes data summarization, visualization, some statistical analysis, and predictive analysis. This article will focus on data storytelling or exploratory data analysis using R and different packages of R.

This article will cover:

  1. The Summarization and Visualization of Some Key Points

2. Some Basic Statistics

3. Predictive Model

If you are a regular follower of my articles, you might have seen another exploratory data analysis project using the same dataset before in Python. Here is the link:

I am using the same dataset here for performing an exploratory data analysis in…

Photo by Sigmund on Unsplash

Straightforward and Easy Reasoning

This is a very common question. Especially for the starters. Where to start? Even for intermediate-level data scientists, this can be a question. Because different people have different choices or different styles of work. Some companies prefer Python and some companies prefer R. I have friends who learned to start Python first and then some recruiters or some employers said they should learn R. Now they start learning R. Actually which one is better?

I started with Python. As I started my MS at Boston University, I had to learn R. Because some of the data analytics courses use R…

Photo by chaitanya pillala on Unsplash

Preprocessing, Analysis, Visualization, and Sentiment Analysis of Text Data

Text data analysis is becoming easier and easier every day. Prominent programming languages like Python and R have great libraries for text data analysis. There was a time when people used to think that you need to be an expert in coding to do these types of complex tasks. But with the more developed and improved version of libraries, it is easier to perform text data analysis with just simple and beginner-level coding knowledge.

In this article, I will work on a dataset that is primarily a text dataset. The dataset contains the customer review of amazon baby products and…

Photo by Morgan Sessions on Unsplash

Learn with examples and projects

It is very important to learn statistics well for data scientists. Learning visualization tools and data manipulation tools are great! But without the knowledge of statistics, it is not possible to infer some real information from the data.

I wrote several tutorials on different inferential statistics topics. Now I realized that if I combine them together it will become a nice course for learners. Also, each of the articles works on a project with a dataset except the first article. So, learning by doing a project is a great way of learning.

At the same time, you will find some…

Photo by Jason Rosewell on Unsplash

Used Some Great Packages and K Nearest Neighbors Classifier

With the increasing number of text documents, text document classification has become an important task in data science. At the same time, machine learning and data mining techniques are also improving every day. Both Python and R programming languages have amazing functionalities for text data cleaning and classification.

This article will focus on text documents processing and classification Using R libraries.

Problem Statement

The data that is used here is text files packed in a folder named 20Newsgroups. This folder has two subfolders. One of them contains training data and the other one contains the test data. Each subfolder contains 20 folders…

Photo by Kalen Emsley on Unsplash

Using Pandas, Matplotlib, Seaborn, and Scikit_learn Libraries in Python

This article focuses on a data storytelling project. In other words Exploratory data analysis. After looking at a big dataset or even a small dataset, it is hard to make sense of it right away. It needs effort, more work, and analysis to extract some meaningful information from that dataset.

In this article, we will take a dataset and use some popular python libraries like Numpy, Pandas, Matplotlib, Seaborn to find some meaningful information from it. And at the end, we will run a prediction model from the scikit-learn library.

As a data scientist or a data analyst, you may…

Photo by Peter Olexa on Unsplash

Data Science, R

Differences in Means by Analyzing the Variance

ANOVA (Analysis of Variance) is a process to compare the means of more than two groups. It can also be used for comparing the means of two groups. But that’s unnecessary. Comparing the means between two groups only can be done using a hypothesis testing method such as a t-test.

If you need a refresher on the t-test or z-test please check this article:

This article will focus on comparing the means of more than two groups using the Analysis of Variance (ANOVA) method. This method breaks down the overall variability of a given continuous outcome into pieces.

One-Way Analysis of Variance

One way…

Rashida Nasrin Sucky

Data Scientist and MS Student at Boston University. Read my blog:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store