Data Scientist at Fiserv | Core Member at Women in AI Ireland | MSc in Computer Science — Intelligent Systems | Fascinated by Natural Language Processing

A simple configuration update could make downloading datasets from Kaggle and running your notebook easier on Google Colab. Follow this article for the script and configuration settings for kaggle.json.

Image for post
Image for post

Kaggle is one of the best practice fields for Data Scientists and many of us like to use Google Colab to play around with datasets due availability of better data processing infrastructure. In this article, I have walked through three simple steps to download any dataset seamlessly from Kaggle with a simple configuration that would

  • not require you to download/upload kaggle.json again and again
  • make re-running jupyter notebooks smoother, even on other machine with access to your Google account and drive.

Step 1: Download your Kaggle API Token

Log in to Kaggle and access your account. Scroll down to the API section:


A brief introduction to label leakage in machine learning and how it affects model performance

Have you been overwhelmed with a perfect or near-perfect model performance? Was that joy crumbled by that one feature that sold you out?

In short, Label leakage or Target Leakage or simply Leakage happens when the information you want to predict is directly or indirectly present in your training dataset. It causes a model to overrepresent its generalization error and incredibly boosts the model’s performance but makes the model useless for any real-world application.

Image for post
Image for post

How Data Leakage occurs

The simplest example would be training a model with the labels itself. In practice, an indirect representation of the target variable is inadvertently introduced while data collection and preparation processes. Features that trigger the outcome and features which are direct consequences of the target variable are gathered during data mining processes and thus should be manually identified while conducting exploratory data analysis. …


This article demonstrates predicting hazardous seismic bumps using different supervised classifiers, tuning model hyperparameters, accuracy paradox and importance of understanding “business problem” for performance assessment

Image for post
Image for post

Introduction

In my previous article on the seismic bumps dataset sourced from UCI’s Data Archive, I applied basic data analysis techniques for feature engineering and test-train splitting strategies for imbalanced datasets. In this article, I have demonstrated how to apply supervised machine learning algorithms (KNN, Random Forest, and SVM) for predictions and tune hyperparameters (using GridSearchCV) for optimum results. The outcomes of the performance assessment also clearly exhibit Accuracy Paradox which I will elaborate on in the following sections. Additionally, it also demonstrates why understanding the “business problem” is essential to pick the best model.

The complete notebook can be found in my GitHub…


This article explores how immersive technology is being leveraged to replace zoos and giving them a chance back to their natural habitats, among their own.

Zoo reminds us of a day-off to see animals, a day full of amusement, picnic, and long-lasting memories. I have fond memories of visiting Zoos in different cities where I saw animals imported from their natural habitats only to survive in captivity. Yes, your day of excitement and family-bonding is another day for them to be ALLOWED to roam around within four walls of a couple of meters.

Image for post
Image for post

Honestly, I am also a victim of the guilty pleasure of learning more about animals and wanting to go to Zoos; I binge-watched TV Shows on the National Geographic Channel and Animal Planet. It is thrilling to see and learn how these amazing beings survive all by themselves, using their unique defence mechanisms and unique “life-styles”. It’s always been like a dichotomy between accessibility to satiate your curiosity and just letting them be in their natural habitats. …


This article demonstrates exploratory data analysis (EDA), feature engineering, and splitting strategies for unbalanced data using the seismic bumps dataset from the UCI Data Archive.

Image for post
Image for post

Introduction:

The seismic bumps dataset is one of the lesser-known binary classification datasets that capture geological conditions using seismic and seismo-acoustic systems in longwall coal mines to assess if they are prone to rockburst causing seismic hazards or not.

Link to the dataset: https://archive.ics.uci.edu/ml/datasets/seismic-bumps

This is a good dataset that gives practical exposure to unbalanced datasets, works on different kinds of data splits, and assessing a classifier’s performance metrics including exhibiting accuracy paradox.

The other thing about this dataset is that it has both categorical as well as numerical features which provides a playground to wrangle and try out different feature transformation methods to use. …


A brief philosophy of knowledge representation for NLP and a concise guide to matrix design & the curse of dimensionality of distributed word representations.

Image for post
Image for post

I remember when I studied algebra for the first time, I had this weird urge of representing these alphabets into words. I had a few scribbles of random alphabets being used as numbers and then just playing around.

Well, as it turns out, when I studied information theory and then word vectors, I was like “I knew it!!!”

One of the fundamentals of computer science is representing your knowledge in the form of numbers and even better — with a combination of 0s and 1s. …


A simple explanation of supervised machine learning terminologies for electrical and electronics engineers

If you landed in the Electrical or Electronics engineering domain (like I was once myself), you are probably dealing with resistances, capacitors and inductors all the time.

Let’s say you haven’t been around the laboratory before but you anyway went through your lab hand-outs and were able to identify the different resistors, capacitors and inductors provided to you during the lab session.

Congratulations! Your brain just executed a supervised machine learning algorithm.

Image for post
Image for post

Why I called it Supervised?

Because you were already aware of what they are, how they look like and may be bits and pieces of how they work. …


A comparative study using Python’s Seaborn and Matplotlib libraries to create an infographic about popular online streaming services

Introduction:

Among those few services that would not run out of business during the pandemic are online streaming services like Netflix and Amazon Prime Video. Take a look below at the Google Trends results of the interest over time from 1st January to 12th June. The boost in web-searches for COVID-19 and Netflix is clearly evident around the mid of March 2020 when COVID-19 cases were soaring worldwide and businesses were shutting down for the inevitable lockdowns. There is also an interesting weekly pattern where the number of searches increases during the weekend.

Image for post
Image for post

Popularity by Google Search:

I suppose, by far, the most popular service is Netflix. To back up my assumption, here is a bar chart of the daily average searches of Amazon Prime Video, Netflix, Disney+ and Hulu. …


Image for post
Image for post

In this blog I have documented some of the common git commands I use on a regular basis or had used a couple of times for a specific purpose. This is definitely not an exhaustive list but more comprehensive and is meant for quick reference.

Initialising/Cloning a GIT repository


Note from the editors: Towards Data Science is a Medium publication primarily based on the study of data science and machine learning. We are not health professionals or epidemiologists, and the opinions of this article should not be interpreted as professional advice. To learn more about the coronavirus pandemic, you can click here.

This is the continuation of my previous blog where I explored the words that appeared in the News Articles in Indian Media related to Covid-19. …

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store