Important Data Science Libraries you might be ignoring

Let me guess — you’re new to data science and you feel like some kind of hipster when you say the word “pandas”. But, as you’re new to this world, you have no idea of all the great tools that are available. Fret no more because I have compiled a list of my favorite python libraries for data science (and a few extra cool ones that I didn’t find mentioned in other blogs).

Python is a very powerful programming language that is playing an important role in data science. It offers numerous libraries and packages. If you are interested in pursuing a career in data science, the variety of packages may seem overwhelming.

In this article we’re going to talk about libraries which are not as popular as Numpy, Pandas or SciPy, but they can be handy in some specific domains.

According to data science experts and enthusiasts, these are the best hidden libraries in python for data science.

Top 10 best python libraries for Data Science ( not so famous !!)

  • Mlxtend : Mlxtend is a library of Python tools for data science and machine learning tasks built on top of scikit-learn, NumPy, Pandas and matplotlib. The main features include: A collection of man-machine learning algorithms, including linear regression, logistic regression; Functions for plotting decision boundaries; Functions for generating evaluation metrics such as confusion matrix; Feature selection (e.g., recursive feature elimination) and many more.
  • Yellowbrick : Yellowbrick is a project to extend the Scikit-Learn API with visual analysis and diagnostic tools to support machine learning workflows. Visualizers can be used in a Scikit-Learn pipeline or independently outside of Scikit-Learn.
  • PhraseMatcher : This is one of the lesser known gems from spaCy, an NLP library in python. PhraseMatcher lets you search over texts using a phrase list, which can be quite useful when trying to find if any of the important words (like product names, etc) are present in the text.
  • fuzzywuzzy: This is a great library for fuzzy string matching. It has several functions available for doing this (fuzz.ratio, fuzz.partial_ratio etc) and each of them take two strings as input and compare them by returning a score/similarity ratio in percentage for how similar are those two strings (based on the function used).
  • Gleam : Gleam is a fantastic tool for building interactive infographics that include panels, buttons, and pages. These interactive web visualisations are also completely web-integrated, meaning they can be embedded in anything from a website to an endpoint!
  • Researchpy : Researchpy is a library for analyzing statistical models in Python. It includes a bunch of different functions that help you do things like compare means, calculate t-tests, and describe variables.
  • Pydotplus : Pydotplus helps you visualize decision trees within Python. You can use it to create decision tree diagrams from models built in Scikit-learn or elsewhere, as well as from scratch using its own syntax.
  • Pyngrok : Pyngrok is used for establishing secure tunnels to localhost servers. Typically, when you build something locally on your computer, it’s not accessible by other people over the internet; Pyngrok gives you access by creating a tunnel from an external location to your local server .
  • Missingo: By utilising data visualisations more effectively, Missingo can assist in the management of missing values. Missingo uses matplotlib to create four different charts to help users comprehend the data that is missing. These are made out of bar graphs. There’s a heatmap, a matrix, and a dendrogram.
  • PyFlux : PyFlux is a Python open source package designed specifically for dealing with time-series challenges. The library has a large collection of current time-series models, including ARIMA, GARCH, and VAR models, among others. PyFlux, in a nutshell, is a probabilistic approach to time-series modelling. It’s definitely worth a shot.

Other cool mentions:

-bamboolib: Provides a simple pandas GUI for beginners

-dask: Parallelizes numpy and pandas operations so you can use all the memory and cores on your machine

datapythonista-utils: A collection of utilities to make working with dates and times easier in Python

-feather: Reads and writes feather files (that you can share between R and Python) really quickly

-fastparquet: Reads/writes parquet files really quickly (use it instead of pd.read_csv!)

-fsspec: A filesystem specification that creates an abstraction layer over multiple filesystems like S3, FTP, HDFS, GCS, etc.

-hdfscontents: A Jupyter notebook extension to interact with HDFS clusters from your notebook

-modin: Parallelizes pandas operations on all the cores on your machine

We will cover detail implementation of some important libraries mentioned in our future post.

For practicing data scientists it is important to understand the different implementations of libraries and to know what is available.

This article provides some perspective on good packages that are somewhat easy to use and provide a simple mechanism for data manipulation. Image processing, machine learning, and scientific computing are just a few of the many topics that can be implemented using Python. Hope you enjoyed this article at MLDots.


Abhishek Mishra

Leave a Reply

Your email address will not be published. Required fields are marked *