Contents

NLP360 : Awesome NLP Resources

NLP360 is curated list of resources related to Natural Language Processing (NLP) : Datasets + Python packages and is updated frequently


NLP Datasets

  • Complete NLP Dataset by The Eye - ArXiv (37GB), PubMed (6GB), StackExchange (34GB), OpenWebText (27GB), Github (106GB)

  • The Big Bad NLP Database - Added the CommonCrawl datasets to the Big Bad NLP Database

  • CommonCrawl by Facebook - Facebook release CommonCrawl dataset of 2.5TB of clean unsupervised text from 100 languages

  • Wikipedia Data - CSV file containing the Wikidata id, title, lat/lng coordinates, and short description for all Wikipedia articles with location data (updated)

  • Datasets by Transformers - Datasets and evaluation metrics for natural language processing by Transformers. Compatible with NumPy, Pandas, PyTorch and TensorFlow


  • Multidomain Sentiment Analysis Dataset - This is a slightly older dataset that features a variety of product reviews taken from Amazon.

  • IMDB Reviews - Featuring 25,000 movie reviews, this relatively small dataset was compiled primarily for binary sentiment classification use cases.

  • Stanford Sentiment Treebank - Also built from movie reviews, Stanford’s dataset was designed to train a model to identify sentiment in longer phrases. It contains over 10,000 snippets taken from Rotten Tomatoes.

  • Sentiment140 - This popular dataset contains 160,000 tweets formatted with 6 fields: polarity, ID, tweet date, query, user, and the text. Emoticons have been pre-removed.

  • Twitter US Airline Sentiment - Scraped in February 2015, these tweets about US airlines are classified as classified as positive, negative, and neutral. Negative tweets have also been categorized by reason for complaint.


  • 20 Newsgroups - This collection of approximately 20,000 documents covers 20 different newsgroups, from baseball to religion.

  • ArXiv - This repository contains all of the arXiv research paper archive as fulltext, with a total dataset size of 270 GB.

  • Reuters News Dataset - The documents in this dataset appeared on Reuters in 1987. They have since been assembled and indexed for use in machine learning.

  • The WikiQA Corpus - This corpus is a publicly-available collection of question and answer pairs. It was originally assembled for use in research on open-domain question answering.

  • UCI’s Spambase - Originally created by a team at Hewlett-Packard, this large spam email dataset is useful for developing personalized spam filters.

  • Yelp Reviews - This open dataset released by Yelp contains more than 5 million reviews.

  • WordNet - Compiled by researchers at Princeton University, WordNet is essentially a large lexical database of English ‘synsets’, or groups of synonyms that each describe a different, distinct concept.

  • The Blog Authorship Corpus – This dataset includes over 681,000 posts written by 19,320 different bloggers. In total, there are over 140 million words within the corpus.

  • Enron Dataset - Over half a million anonymized emails from over 100 users. It’s one of the few publically available collections of “real” emails available for study and training sets.

  • Project Gutenberg - Extensive collection of book texts. These are public domain and available in a variety of languages, spanning a long period of time.


NLP Python Packages

  • Haystack - Open-source framework for building end-to-end question answering systems for large document collections.

  • AdaptNLP - Powerful NLP toolkit built on top of Flair and Transformers for running, training and deploying state of the art deep learning models. Unified API for end to end NLP tasks: Token tagging, Text Classification, Question Anaswering, Embeddings, Translation, Text Generation etc.

  • Sentence-Transformers - Python package to compute the dense vector representations of sentences or paragraphs using SOTA pretrained Transformers models.

  • Tweet-Preprocessor - Python library to clean text/tweets in a single line of code.

  • SimpleTransformers - Simple library to build any NLP deep learning models in 3 lines of code. It packs all the powerful features of Huggingface’s transformers in just 3 lines of code for end to end NLP tasks.

  • TextAttack - Adversarial attacks, adversarial training, and data augmentation in NLP

  • Fast.ai - Super high-level abstractions and easy implementations for NLP data preprocessing, model construction, training, and evaluation.

  • TorchText - Convenient data processing utilities to process and prepare them in batches before you feed them into your deep learning framework

  • OpenNMT - Convenient and powerful tool for the machine translation and sequence learning tasks

  • ParlAI - Task-Oriented Dialogue, Chit-chat Dialogue, Visual Question Answering

  • DeepPavlov - Framework mainly for chatbots and virtual assistants development, as it provides all the environment tools necessary for a production-ready and industry-grade conversational agent

  • TextBlob - Providing a consistent API for diving into common natural language processing (NLP) tasks. Stands on the giant shoulders of Natural Language Toolkit (NLTK) and Pattern, and plays nicely with both :+1:

  • spaCy - Industrial strength NLP with Python and Cython :+1:

  • textacy - Higher level NLP built on spaCy

  • gensim - Python library to conduct unsupervised semantic modelling from plain text :+1:

  • scattertext - Python library to produce d3 visualizations of how language differs between corpora

  • GluonNLP - A deep learning toolkit for NLP, built on MXNet/Gluon, for research prototyping and industrial deployment of state-of-the-art models on a wide range of NLP tasks.

  • AllenNLP - An NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks.

  • PyTorch-NLP - NLP research toolkit designed to support rapid prototyping with better data loaders, word vector loaders, neural network layer representations, common NLP metrics such as BLEU

  • Rosetta - Text processing tools and wrappers (e.g. Vowpal Wabbit)

  • PyNLPl - Python Natural Language Processing Library. General purpose NLP library for Python. Also contains some specific modules for parsing common NLP formats, most notably for FoLiA, but also ARPA language models, Moses phrasetables, GIZA++ alignments.

  • PySS3 - Python package that implements a novel white-box machine learning model for text classification, called SS3. Since SS3 has the ability to visually explain its rationale, this package also comes with easy-to-use interactive visualizations tools (online demos).

  • jPTDP - A toolkit for joint part-of-speech (POS) tagging and dependency parsing. jPTDP provides pre-trained models for 40+ languages.

  • BigARTM - a fast library for topic modelling

  • Snips NLU - A production ready library for intent parsing

  • Chazutsu - A library for downloading&parsing standard NLP research datasets

  • Word Forms - Word forms can accurately generate all possible forms of an English word

  • Multilingual Latent Dirichlet Allocation (LDA) - A multilingual and extensible document clustering pipeline

  • NLP Architect - A library for exploring the state-of-the-art deep learning topologies and techniques for NLP and NLU

  • Flair - A very simple framework for state-of-the-art multilingual NLP built on PyTorch. Includes BERT, ELMo and Flair embeddings.

  • Kashgari - Simple, Keras-powered multilingual NLP framework, allows you to build your models in 5 minutes for named entity recognition (NER), part-of-speech tagging (PoS) and text classification tasks. Includes BERT and word2vec embedding.

  • FARM - FARM makes cutting-edge transfer learning simple and helps you to leverage pretrained language models for your own NLP tasks.

  • Rita DSL - a DSL, loosely based on RUTA on Apache UIMA. Allows to define language patterns (rule-based NLP) which are then translated into spaCy, or if you prefer less features and lightweight - regex patterns.

  • Transformers - Natural Language Processing for TensorFlow 2.0 and PyTorch.

  • Tokenizers - Tokenizers optimized for Research and Production.

  • fairSeq Facebook AI Research implementations of SOTA seq2seq models in Pytorch.

  • corex_topic - Hierarchical Topic Modeling with Minimal Domain Knowledge