Python community offers jillion of python packages for data science pipeline starting from data cleaning to building deep learning models to deployment. Most appreciated and commonly used packages are -
Pandas : Data manipulation and analysis
Matplotlib/Seaborn/Plotly : Data visualization
Scikit-learn : Building Machine Learning models
Keras/Tensorflow/Pytorch : building deep learning models
Flask : Web app development/ML Applications
These packages have received their appreciation and love from the data science community.
But there are some python libraries in data science that are useful but underrated at the same time.
These packages can save you from writing a lot of code.
They give you the ease of using state of the art models in one just single line of code.
Let’s dive in.
1. MissingNo
MissingNo is a python library for null value or missing values analysis with impressive visualization like data display, bar charts, heatmaps and dendograms.
# pip install missingnoimportmissingnoasmsno# missing value visualization: dense data displaymsno.matrix(dataframe)# missing value visualization: bar chartsmsno.bar(dataframe)# missing value visualization: heatmapsmsno.heatmap(dataframe)# missing value visualization: Dendogrammsno.dendrogram(dataframe)
As the name suggests, it extends the current implementation of many machine learning algorithms, makes it more useful to use and definitely saves a lot of time.
For e.g. Association Rule Mining (with Apriori, Fpgrowth & Fpmax support), EnsembleVoteClassifier, StackingCVClassifier
fromsklearnimportmodel_selectionfromsklearn.linear_modelimportLogisticRegressionfromsklearn.neighborsimportKNeighborsClassifierfromsklearn.naive_bayesimportGaussianNBfromsklearn.ensembleimportRandomForestClassifierfrommlxtend.classifierimportStackingCVClassifierimportnumpyasnpimportwarningswarnings.simplefilter('ignore')RANDOM_SEED=42clf1=KNeighborsClassifier(n_neighbors=1)clf2=RandomForestClassifier(random_state=RANDOM_SEED)clf3=GaussianNB()lr=LogisticRegression()# Starting from v0.16.0, StackingCVRegressor supports# `random_state` to get deterministic result.sclf=StackingCVClassifier(classifiers=[clf1,clf2,clf3],meta_classifier=lr,random_state=RANDOM_SEED)print('3-fold cross validation:\n')forclf,labelinzip([clf1,clf2,clf3,sclf],['KNN','Random Forest','Naive Bayes','StackingClassifier']):scores=model_selection.cross_val_score(clf,X,y,cv=3,scoring='accuracy')print("Accuracy: %0.2f (+/- %0.2f) [%s]"%(scores.mean(),scores.std(),label))
3. Flair
Flair is a powerful NLP library which allows you to apply our state-of-the-art natural language processing (NLP) models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS), sense disambiguation and classification.
Yes - You have many libraries which promises that - What sets Flair apart?
It’s Stacked embeddings!
Stacked embeddings is one of the most interesting features of Flair which will make you use this library even more.
They provide means to combine different embeddings together. You can use both traditional word embeddings (like GloVe, word2vec, ELMo) together with Flair contextual string embeddings or BERT.
You can very easily mix and match Flair, ELMo, BERT and classic word embeddings. All you need to do is instantiate each embedding you wish to combine and use them in a StackedEmbedding.
For instance, let’s say we want to combine the multilingual Flair and BERT embeddings to train a hyper-powerful multilingual downstream task model. First, instantiate the embeddings you wish to combine:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
fromflair.embeddingsimportFlairEmbeddings,BertEmbeddings# init Flair embeddingsflair_forward_embedding=FlairEmbeddings('multi-forward')flair_backward_embedding=FlairEmbeddings('multi-backward')# init multilingual BERTbert_embedding=BertEmbeddings('bert-base-multilingual-cased')fromflair.embeddingsimportStackedEmbeddings# now create the StackedEmbedding object that combines all embeddingsstacked_embeddings=StackedEmbeddings(embeddings=[flair_forward_embedding,flair_backward_embedding,bert_embedding])
Now just use this embedding like all the other embeddings, i.e. call the embed() method over your sentences.
1
2
3
4
5
6
7
8
9
10
11
sentence=Sentence('The grass is green .')# just embed a sentence using the StackedEmbedding # as you would with any single embedding.stacked_embeddings.embed(sentence)# now check out the embedded tokens.fortokeninsentence:print(token)print(token.embedding)
4. AdaptNLP
AdaptNLP is another easy to use but powerful NLP toolkit built on top of Flair and Transformers for running, training and deploying state of the art deep learning models.
It has a unified API for end to end NLP tasks: Token tagging, Text Classification, Question Anaswering, Embeddings, Translation, Text Generation etc.
fromadaptnlpimportEasyQuestionAnsweringfrompprintimportpprint## Example Query and Context query="What is the meaning of life?"context="Machine Learning is the meaning of life."top_n=5## Load the QA module and run inference on results qa=EasyQuestionAnswering()best_answer,best_n_answers=qa.predict_qa(query=query,context=context,n_best_size=top_n,mini_batch_size=1,model_name_or_path="distilbert-base-uncased-distilled-squad")## Output top answer as well as top 5 answersprint(best_answer)pprint(best_n_answers)
fromadaptnlpimportEasySummarizer# Text from encyclopedia Britannica on Einsteintext="""Einstein would write that two “wonders” deeply affected his early years. The first was his encounter with a compass at age five.
He was mystified that invisible forces could deflect the needle. This would lead to a lifelong fascination with invisible forces.
The second wonder came at age 12 when he discovered a book of geometry, which he devoured, calling it his 'sacred little geometry
book'. Einstein became deeply religious at age 12, even composing several songs in praise of God and chanting religious songs on
the way to school. This began to change, however, after he read science books that contradicted his religious beliefs. This challenge
to established authority left a deep and lasting impression. At the Luitpold Gymnasium, Einstein often felt out of place and victimized
by a Prussian-style educational system that seemed to stifle originality and creativity. One teacher even told him that he would
never amount to anything."""summarizer=EasySummarizer()# Summarizesummaries=summarizer.summarize(text=text,model_name_or_path="t5-small",mini_batch_size=1,num_beams=4,min_length=0,max_length=100,early_stopping=True)print("Summaries:\n")forsinsummaries:print(s,"\n")
5. SimpleTransformers
SimpleTransformers is awesome and my go to library for any NLP deep learning models. It packs all the powerful features of Huggingface’s transformers in just 3 lines of code for end to end NLP tasks.
fromsimpletransformers.classificationimportClassificationModelimportpandasaspdimportlogginglogging.basicConfig(level=logging.INFO)transformers_logger=logging.getLogger("transformers")transformers_logger.setLevel(logging.WARNING)# Train and Evaluation data needs to be in a Pandas Dataframe of two columns. The first column is the text with type str, and the second column is the label with type int.train_data=[['Example sentence belonging to class 1',1],['Example sentence belonging to class 0',0]]train_df=pd.DataFrame(train_data)eval_data=[['Example eval sentence belonging to class 1',1],['Example eval sentence belonging to class 0',0]]eval_df=pd.DataFrame(eval_data)# Create a ClassificationModelmodel=ClassificationModel('roberta','roberta-base')# You can set class weights by using the optional weight argument# Train the modelmodel.train_model(train_df)# Evaluate the modelresult,model_outputs,wrong_predictions=model.eval_model(eval_df)
These embeddings are useful for various downstream tasks like semantic search or clustering
Sample code for computing Embeddings
1
2
3
4
5
6
7
8
9
10
11
12
13
fromsentence_transformersimportSentenceTransformermodel=SentenceTransformer('distilbert-base-nli-mean-tokens')sentences=['This framework generates embeddings for each input sentence','Sentences are passed as a list of string.','The quick brown fox jumps over the lazy dog.']sentence_embeddings=model.encode(sentences)forsentence,embeddinginzip(sentences,sentence_embeddings):print("Sentence:",sentence)print("Embedding:",embedding)print("")
fromsentence_transformersimportSentenceTransformer,utilimporttorchembedder=SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')# Corpus with example sentencescorpus=['A man is eating food.','A man is eating a piece of bread.','The girl is carrying a baby.','A man is riding a horse.','A woman is playing violin.','Two men pushed carts through the woods.','A man is riding a white horse on an enclosed ground.','A monkey is playing drums.','A cheetah is running behind its prey.']corpus_embeddings=embedder.encode(corpus,convert_to_tensor=True)# Query sentences:queries=['A man is eating pasta.','Someone in a gorilla costume is playing a set of drums.','A cheetah chases prey on across a field.']# Find the closest 5 sentences of the corpus for each query sentence based on cosine similaritytop_k=5forqueryinqueries:query_embedding=embedder.encode(query,convert_to_tensor=True)cos_scores=util.pytorch_cos_sim(query_embedding,corpus_embeddings)[0]cos_scores=cos_scores.cpu()#We use torch.topk to find the highest 5 scorestop_results=torch.topk(cos_scores,k=top_k)print("\n\n======================\n\n")print("Query:",query)print("\nTop 5 most similar sentences in corpus:")forscore,idxinzip(top_results[0],top_results[1]):print(corpus[idx],"(Score: %.4f)"%(score))
7. Tweet-Preprocessor
Preprocessing social data can be a bit frustrating at times bacause of irrelevant elements within the text like links, emojis, hashtags, usernames, mentions etc..
But not any more!
Tweet-Preprocessor is for you. This library cleans your text/tweets in a single line of code.
>>>importpreprocessorasp>>>p.clean('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')# output: 'Preprocessor is'
Currently supports cleaning, tokenizing and parsing URLs, Hashtags, Mentions, Reserved words (RT, FAV), Emojis, SmileysNumbers and you have full control over what you want to clean from the text.
8. Gradio
Gradio is another super cool library to quickly create customizable UI components to demo your ML/DL models within your jupyter notebook or in the browser.
It detects both linear and non-linear relationship between 2 columns
It gives a normalized score ranging from 0 (no predictive power) to 1 (perfect predictive power)
It takes both numeric and categorical variables as input, so no need to convert your categorical variables into dummy variables before feeding it to PPScore.
1
2
3
4
5
6
7
importppscoreaspps# Based on the dataframe we can calculate the PPS of x predicting ypps.score(df,"x","y")# We can calculate the PPS of all the predictors in the dataframe against a target ypps.predictors(df,"y")
10. Pytorch-Forecasting
Pytorch-Forecasting is python toolkit built on top of pytorch-lightening which aims to solve time series forecasting with neural networks with ease.
This library provides abstraction over handling missing values, variable transformation, Tensorboard support, prediction & dependency plots, Range optimizer for faster training and Optuna for hyperparamter tuning.
importpytorch_lightningasplfrompytorch_lightning.callbacksimportEarlyStopping,LearningRateMonitorfrompytorch_forecastingimportTimeSeriesDataSet,TemporalFusionTransformer# load datadata=...# define datasetmax_encode_length=36max_prediction_length=6training_cutoff="YYYY-MM-DD"# day for cutofftraining=TimeSeriesDataSet(data[lambdax:x.date<=training_cutoff],time_idx=...,target=...,group_ids=[...],max_encode_length=max_encode_length,max_prediction_length=max_prediction_length,static_categoricals=[...],static_reals=[...],time_varying_known_categoricals=[...],time_varying_known_reals=[...],time_varying_unknown_categoricals=[...],time_varying_unknown_reals=[...],)validation=TimeSeriesDataSet.from_dataset(training,data,min_prediction_idx=training.index.time.max()+1,stop_randomization=True)batch_size=128train_dataloader=training.to_dataloader(train=True,batch_size=batch_size,num_workers=2)val_dataloader=validation.to_dataloader(train=False,batch_size=batch_size,num_workers=2)early_stop_callback=EarlyStopping(monitor="val_loss",min_delta=1e-4,patience=1,verbose=False,mode="min")lr_logger=LearningRateMonitor()trainer=pl.Trainer(max_epochs=100,gpus=0,gradient_clip_val=0.1,limit_train_batches=30,callbacks=[lr_logger,early_stop_callback],)tft=TemporalFusionTransformer.from_dataset(training,learning_rate=0.03,hidden_size=32,attention_head_size=1,dropout=0.1,hidden_continuous_size=16,output_size=7,loss=QuantileLoss(),log_interval=2,reduce_on_plateau_patience=4)print(f"Number of parameters in network: {tft.size()/1e3:.1f}k")# find optimal learning rateres=trainer.lr_find(tft,train_dataloader=train_dataloader,val_dataloaders=val_dataloader,early_stop_threshold=1000.0,max_lr=0.3,)print(f"suggested learning rate: {res.suggestion()}")fig=res.plot(show=True,suggest=True)fig.show()trainer.fit(tft,train_dataloader=train_dataloader,val_dataloaders=val_dataloader,)