The fake news dataset is one of the classic text analytics datasets available on Kaggle. It consists of genuine and fake articles’ titles and text from different authors. In this article, I have walked through the entire text classification process using traditional machine learning approaches as well as deep learning.
I started with downloading the dataset from Kaggle on Google Colab.
Next, I read the DataFrame and checked the null values in it. There are 7 null values in the text articles, 122 in title and 503 in author out of a total of 20800 rows, I decided to…
Kaggle is one of the best practice fields for Data Scientists and many of us like to use Google Colab to play around with datasets due availability of better data processing infrastructure. In this article, I have walked through three simple steps to download any dataset seamlessly from Kaggle with a simple configuration that would
Log in to Kaggle and access your account. Scroll down to the API section:
Have you been overwhelmed with a perfect or near-perfect model performance? Was that joy crumbled by that one feature that sold you out?
In short, Label leakage or Target Leakage or simply Leakage happens when the information you want to predict is directly or indirectly present in your training dataset. It causes a model to overrepresent its generalization error and incredibly boosts the model’s performance but makes the model useless for any real-world application.
The simplest example would be training a model with the labels itself. In practice, an indirect representation of the target variable is inadvertently introduced while data…
In my previous article on the seismic bumps dataset sourced from UCI’s Data Archive, I applied basic data analysis techniques for feature engineering and test-train splitting strategies for imbalanced datasets. In this article, I have demonstrated how to apply supervised machine learning algorithms (KNN, Random Forest, and SVM) for predictions and tune hyperparameters (using GridSearchCV) for optimum results. The outcomes of the performance assessment also clearly exhibit Accuracy Paradox which I will elaborate on in the following sections. Additionally, it also demonstrates why understanding the “business problem” is essential to pick the best model.
Zoo reminds us of a day-off to see animals, a day full of amusement, picnic, and long-lasting memories. I have fond memories of visiting Zoos in different cities where I saw animals imported from their natural habitats only to survive in captivity. Yes, your day of excitement and family-bonding is another day for them to be ALLOWED to roam around within four walls of a couple of meters.
Honestly, I am also a victim of the guilty pleasure of learning more about animals and wanting to go to Zoos; I binge-watched TV Shows on the National Geographic Channel and Animal…
The seismic bumps dataset is one of the lesser-known binary classification datasets that capture geological conditions using seismic and seismo-acoustic systems in longwall coal mines to assess if they are prone to rockburst causing seismic hazards or not.
Link to the dataset: https://archive.ics.uci.edu/ml/datasets/seismic-bumps
This is a good dataset that gives practical exposure to unbalanced datasets, works on different kinds of data splits, and assessing a classifier’s performance metrics including exhibiting accuracy paradox.
The other thing about this dataset is that it has both categorical as well as numerical features which provides a playground to wrangle and try out different feature…
I remember when I studied algebra for the first time, I had this weird urge of representing these alphabets into words. I had a few scribbles of random alphabets being used as numbers and then just playing around.
Well, as it turns out, when I studied information theory and then word vectors, I was like “I knew it!!!”
One of the fundamentals of computer science is representing your knowledge in the form of numbers and even better — with a combination of 0s and 1s. …
If you landed in the Electrical or Electronics engineering domain (like I was once myself), you are probably dealing with resistances, capacitors and inductors all the time.
Let’s say you haven’t been around the laboratory before but you anyway went through your lab hand-outs and were able to identify the different resistors, capacitors and inductors provided to you during the lab session.
Congratulations! Your brain just executed a supervised machine learning algorithm.
Because you were already aware of what they are, how they look like and may be bits and pieces of how they work. …
Among those few services that would not run out of business during the pandemic are online streaming services like Netflix and Amazon Prime Video. Take a look below at the Google Trends results of the interest over time from 1st January to 12th June. The boost in web-searches for COVID-19 and Netflix is clearly evident around the mid of March 2020 when COVID-19 cases were soaring worldwide and businesses were shutting down for the inevitable lockdowns. There is also an interesting weekly pattern where the number of searches increases during the weekend.
I suppose, by far, the most popular service…
In this blog I have documented some of the common git commands I use on a regular basis or had used a couple of times for a specific purpose. This is definitely not an exhaustive list but more comprehensive and is meant for quick reference.