Skip to main content

Command Palette

Search for a command to run...

Some Data Science Projects Every Data Scientist Must Know

Published
4 min readView as Markdown
Some Data Science Projects Every Data Scientist Must Know

Data Science

Open source data science projects to enhance your portfolio
Let’s divide the projects into categories:

  1. Open Sourcer Computer Vision

  2. FaceX-Zoo

  3. Bottleneck Transformer — Pytorch
  4. StyleGAN2-ADA — Official PyTorch implementation

2. Open Source Natural Language Processing

  • Trankit
  • EasyNMT — Easy to use, state-of-the-art Neural Machine Translation

3. Open Source Machine Learning

  • SeaLion

1. Open Sourcer Computer Vision

FaceX-Zoo

FaceX-Zoo has to be one of the most impressive projects of the month. With face recognition becoming more and more relevant in the realm of computer vision FaceX-Zoo is an open-source data science project you do not want to miss.

FaceX-Zoo is a face recognition PyTorch toolbox. It comes with a training module having different supervisory heads and backbones towards state-of-the-art face recognition. It has a standardized evaluation module, enabling the evaluation of models in most of the popular benchmarks just by editing a simple configuration.

Also, a simple yet fully functional face SDK is provided for the validation and primary application of the trained models. Also, FaceX-Zoo easily upgrades and extends along with the development of face-related domains.

[GitHub - Medium-Posts/FaceX-Zoo: A PyTorch Toolbox for Face Recognition
FaceX-Zoo is a PyTorch toolbox for face recognition. It provides a training module with various supervisory heads and…github.com](https://github.com/Medium-Posts/FaceX-Zoo "https://github.com/Medium-Posts/FaceX-Zoo")

Bottleneck Transformer — Pytorch

Another mind-blowing project in computer vision, Bottleneck Transformer looks like a very good project to add to your data science portfolio.

The paper says-

“It is simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection, and instance segmentation”

Baseline models see significant improvement by simply replacing the last 3 bottleneck blocks of a ResNet and no other changes. Sounds promising, doesn’t it?

The Bottleneck transformer has all the potential to serve as a strong baseline for future research in self-attention models for vision.

[GitHub - Medium-Posts/bottleneck-transformer-pytorch: Implementation of Bottleneck Transformer in…
Implementation of Bottleneck Transformer, SotA visual recognition model with convolution + attention that outperforms…github.com](https://github.com/Medium-Posts/bottleneck-transformer-pytorch "https://github.com/Medium-Posts/bottleneck-transformer-pytorch")

StyleGAN2-ADA — Official PyTorch implementation

When generative adversarial networks are trained using too small data, it may end up in discriminator overfitting, causing training to diverge. This project comes with a solution by including an adaptive discriminator augmentation mechanism that can stabilize training in limited data regimes.

The project come with a lot of promises including-

  • Full support for all primary training configurations.
  • Extensive verification of image quality, training curves, and quality metrics against the TensorFlow version.
  • Results are expected to match in all cases, excluding the effects of pseudo-random numbers and floating-point arithmetic.

With increased speed and efficiency as compared to other projects, StyleGAN2-ADA is a nice open-sourced project to add to your portfolio.

[GitHub - Medium-Posts/stylegan2-ada-pytorch: StyleGAN2-ADA - Official PyTorch implementation
Training Generative Adversarial Networks with Limited Data Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine…github.com](https://github.com/Medium-Posts/stylegan2-ada-pytorch "https://github.com/Medium-Posts/stylegan2-ada-pytorch")

2. Open Source Natural Language Processing

Trankit

The fascinating world of NLP is not far behind when it comes to impressive open-sourced data science projects. Trankit is another popular project released last month.

Trankit is a light-weight transformer-based python toolkit for multilingual Natural Language Processing. Its 2 main constituents include-

Another impressive thing about Trankit is that it beats the current state-of-the-art multilingual toolkit Stanza (StanfordNLP) in many tasks over 90 Universal Dependencies v2.5 treebanks of 56 different languages without losing efficiency in memory usage and speed, making it usable amongst a larger audience.

[GitHub - Medium-Posts/trankit: Trankit is a Light-Weight Transformer-based Python Toolkit for…
Our technical paper for Trankit won the Outstanding Demo Paper Award at EACL 2021. Please cite the paper if you use…github.com](https://github.com/Medium-Posts/trankit "https://github.com/Medium-Posts/trankit")

EasyNMT — Easy to use, state-of-the-art Neural Machine Translation

Neural Machine Tranlation

With Easy installation, usage, and Automatic download of pre-trained machine translation models, EasyMNT will easily make your NLP portfolio stand out.

It has translation between 150+ languages and automatic language detection for 170+ languages along with sentence and document translation.

At present, the project provides the following models-

[GitHub - Medium-Posts/EasyNMT: Easy to use, state-of-the-art Neural Machine Translation for 100+…
This package provides easy to use, state-of-the-art machine translation for more than 100+ languages.github.com](https://github.com/Medium-Posts/EasyNMT "https://github.com/Medium-Posts/EasyNMT")

3. Open Source Machine Learning

SeaLion

SeaLion is a brilliant Machine Learning Project created to teach the concepts in a more easy manner using concise algorithms capable of doing the tasks efficiently.

SeaLion

SeaLion is designed to teach today’s aspiring ml-engineers the popular machine learning concepts of today in a way that gives both intuition and ways of application.

It is beginner-friendly when it comes to solving the standard libraries like iris, breast cancer, swiss roll, the moons dataset, MNIST, etc. The algorithms in SeaLion include:

  1. Deep Neural Networks
  2. Regression
  3. Dimensionality Reduction
  4. Unsupervised Clustering
  5. Naive Bayes
  6. Trees
  7. Ensemble Learning
  8. Nearest Neighbors
  9. Utils

[GitHub - Medium-Posts/SeaLion: The first machine learning framework that encourages learning ML…
SeaLion is designed to teach today's aspiring ml-engineers the popular machine learning concepts of today in a way that…github.com](https://github.com/Medium-Posts/SeaLion "https://github.com/Medium-Posts/SeaLion")