Sign in

Data Scientist @ ACI Worldwide | Edu Co-Lead @ Women in AI Ireland | Python and Data Science Instructor @ WAIA | ML Engineer @ Omdena | ❤ NLP

A simple guide to applying traditional machine learning and deep learning techniques using Python on Kaggle’s Fake News Dataset. It also briefly includes text and stylometric analysis of the articles.

The fake news dataset is one of the classic text analytics datasets available on Kaggle. It consists of genuine and fake articles’ titles and text from different authors. In this article, I have walked through the entire text classification process using traditional machine learning approaches as well as deep learning.

Getting Started

I started with downloading the dataset from Kaggle on Google Colab.

Next, I read the DataFrame and checked the null values in it. There are 7 null values in the text articles, 122 in title and 503 in author out of a total of 20800 rows, I decided to…


A simple configuration update could make downloading datasets from Kaggle and running your notebook easier on Google Colab. Follow this article for the script and configuration settings for kaggle.json.

Kaggle is one of the best practice fields for Data Scientists and many of us like to use Google Colab to play around with datasets due availability of better data processing infrastructure. In this article, I have walked through three simple steps to download any dataset seamlessly from Kaggle with a simple configuration that would

  • not require you to download/upload kaggle.json again and again
  • make re-running jupyter notebooks smoother, even on other machine with access to your Google account and drive.

Step 1: Download your Kaggle API Token

Log in to Kaggle and access your account. Scroll down to the API section:


Other ML Jargons

A brief introduction to label leakage in machine learning and how it affects model performance

Have you been overwhelmed with a perfect or near-perfect model performance? Was that joy crumbled by that one feature that sold you out?

In short, Label leakage or Target Leakage or simply Leakage happens when the information you want to predict is directly or indirectly present in your training dataset. It causes a model to overrepresent its generalization error and incredibly boosts the model’s performance but makes the model useless for any real-world application.

How Data Leakage occurs

The simplest example would be training a model with the labels itself. In practice, an indirect representation of the target variable is inadvertently introduced while data…


This article demonstrates predicting hazardous seismic bumps using different supervised classifiers, tuning model hyperparameters, accuracy paradox and importance of understanding “business problem” for performance assessment

Introduction

In my previous article on the seismic bumps dataset sourced from UCI’s Data Archive, I applied basic data analysis techniques for feature engineering and test-train splitting strategies for imbalanced datasets. In this article, I have demonstrated how to apply supervised machine learning algorithms (KNN, Random Forest, and SVM) for predictions and tune hyperparameters (using GridSearchCV) for optimum results. The outcomes of the performance assessment also clearly exhibit Accuracy Paradox which I will elaborate on in the following sections. Additionally, it also demonstrates why understanding the “business problem” is essential to pick the best model.

The complete notebook can be found…


WaiAI4Good

This article explores how immersive technology is being leveraged to replace zoos and giving them a chance back to their natural habitats, among their own.

Zoo reminds us of a day-off to see animals, a day full of amusement, picnic, and long-lasting memories. I have fond memories of visiting Zoos in different cities where I saw animals imported from their natural habitats only to survive in captivity. Yes, your day of excitement and family-bonding is another day for them to be ALLOWED to roam around within four walls of a couple of meters.

Honestly, I am also a victim of the guilty pleasure of learning more about animals and wanting to go to Zoos; I binge-watched TV Shows on the National Geographic Channel and Animal…


This article demonstrates exploratory data analysis (EDA), feature engineering, and splitting strategies for unbalanced data using the seismic bumps dataset from the UCI Data Archive.

Introduction:

The seismic bumps dataset is one of the lesser-known binary classification datasets that capture geological conditions using seismic and seismo-acoustic systems in longwall coal mines to assess if they are prone to rockburst causing seismic hazards or not.

Link to the dataset: https://archive.ics.uci.edu/ml/datasets/seismic-bumps

This is a good dataset that gives practical exposure to unbalanced datasets, works on different kinds of data splits, and assessing a classifier’s performance metrics including exhibiting accuracy paradox.

The other thing about this dataset is that it has both categorical as well as numerical features which provides a playground to wrangle and try out different feature…


A brief philosophy of knowledge representation for NLP and a concise guide to matrix design & the curse of dimensionality of distributed word representations.

I remember when I studied algebra for the first time, I had this weird urge of representing these alphabets into words. I had a few scribbles of random alphabets being used as numbers and then just playing around.

Well, as it turns out, when I studied information theory and then word vectors, I was like “I knew it!!!”

One of the fundamentals of computer science is representing your knowledge in the form of numbers and even better — with a combination of 0s and 1s. …


A simple explanation of supervised machine learning terminologies for electrical and electronics engineers

If you landed in the Electrical or Electronics engineering domain (like I was once myself), you are probably dealing with resistances, capacitors and inductors all the time.

Let’s say you haven’t been around the laboratory before but you anyway went through your lab hand-outs and were able to identify the different resistors, capacitors and inductors provided to you during the lab session.

Congratulations! Your brain just executed a supervised machine learning algorithm.

Why I called it Supervised?

Because you were already aware of what they are, how they look like and may be bits and pieces of how they work. …


A comparative study using Python’s Seaborn and Matplotlib libraries to create an infographic about popular online streaming services

Introduction:

Among those few services that would not run out of business during the pandemic are online streaming services like Netflix and Amazon Prime Video. Take a look below at the Google Trends results of the interest over time from 1st January to 12th June. The boost in web-searches for COVID-19 and Netflix is clearly evident around the mid of March 2020 when COVID-19 cases were soaring worldwide and businesses were shutting down for the inevitable lockdowns. There is also an interesting weekly pattern where the number of searches increases during the weekend.

Popularity by Google Search:

I suppose, by far, the most popular service…


In this blog I have documented some of the common git commands I use on a regular basis or had used a couple of times for a specific purpose. This is definitely not an exhaustive list but more comprehensive and is meant for quick reference.

Initialising/Cloning a GIT repository

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store