Open-Source NLP Projects (With Tutorials)

Greetings! Some links on this site are affiliate links. That means that, if you choose to make a purchase, The Click Reader may earn a small commission at no extra cost to you. We greatly appreciate your support!

Join Datacamp’s Natural Language Processing (NLP) Skill Track and become an expert NLP engineer: Enroll in the NLP in Python Skill Track Today!


If you are a student or a professional looking for various open-source Natural Language Processing (NLP) projects, then, this article is made to help you.

The NLP projects listed below are categorized in an experience-wise manner. All of these projects can be implemented using Python.

Natural Language Processing (NLP) Data Science Projects with Github Repos

1. Text Summarizer – Video Tutorial, Github Code

Text Summarizer is a project that can summarize long paragraphs of text into a single line summary. It can turn an article into a summary using Python and Keras library. Moreover, the project builds concepts of NLP including word embeddings and encoder-decoder and is fairly easy to understand. Have a look at this tutorial on creating a basic Text summarizer in Python using the gensim library.

Full code for Text Summarization in Python:

# import the gensim module and summarize function
from gensim.summarization.summarizer import summarize 

# Paragraph
paragraph = "Natural language processing (NLP) is the ability of a computer program to understand human language as it is spoken. NLP is a component of artificial intelligence (AI). The development of NLP applications is challenging because computers traditionally require humans to 'speak' to them in a programming language that is precise, unambiguous and highly structured, or through a limited number of clearly enunciated voice commands. Human speech, however, is not always precise -- it is often ambiguous and the linguistic structure can depend on many complex variables, including slang, regional dialects and social context."

# Get the Summary of the text based on percentage (0.5% of the original content). 
summ_per = summarize(paragraph, ratio = 0.4) 
print("Percent summary:") 
print(summ_per) 

# Get the summary of the text based on number of words (50 words) 
summ_words = summarize(paragraph, word_count = 50) 
print("\n")
print("Word count summary:") 
print(summ_words) 

2. Personal Voice Assistant – Tutorial, Github Code

We are quite familiar with the term Voice Assistant with the rise of devices from Apple (Siri), Google (Google Assistant), Amazon (Alexa), Microsoft (Cortana), and many more. Voice assistants are able to perform tasks or services as per the user’s commands.

Full Code for creating voice assistant in Python:

from gtts import gTTS
import speech_recognition as sr
import os
import re
import webbrowser
import smtplib
import requests
from weather import Weather

def talkToMe(audio):
    "speaks audio passed as argument"

    print(audio)
    for line in audio.splitlines():
        os.system("say " + audio)

    #  use the system's inbuilt say command instead of mpg123
    #  text_to_speech = gTTS(text=audio, lang='en')
    #  text_to_speech.save('audio.mp3')
    #  os.system('mpg123 audio.mp3')


def myCommand():
    "listens for commands"

    r = sr.Recognizer()

    with sr.Microphone() as source:
        print('Ready...')
        r.pause_threshold = 1
        r.adjust_for_ambient_noise(source, duration=1)
        audio = r.listen(source)

    try:
        command = r.recognize_google(audio).lower()
        print('You said: ' + command + '\n')

    #loop back to continue to listen for commands if unrecognizable speech is received
    except sr.UnknownValueError:
        print('Your last command couldn\'t be heard')
        command = myCommand();

    return command


def assistant(command):
    "if statements for executing commands"

    if 'open reddit' in command:
        reg_ex = re.search('open reddit (.*)', command)
        url = 'https://www.reddit.com/'
        if reg_ex:
            subreddit = reg_ex.group(1)
            url = url + 'r/' + subreddit
        webbrowser.open(url)
        print('Done!')

    elif 'open website' in command:
        reg_ex = re.search('open website (.+)', command)
        if reg_ex:
            domain = reg_ex.group(1)
            url = 'https://www.' + domain
            webbrowser.open(url)
            print('Done!')
        else:
            pass

    elif 'what\'s up' in command:
        talkToMe('Just doing my thing')
    elif 'joke' in command:
        res = requests.get(
                'https://icanhazdadjoke.com/',
                headers={"Accept":"application/json"}
                )
        if res.status_code == requests.codes.ok:
            talkToMe(str(res.json()['joke']))
        else:
            talkToMe('oops!I ran out of jokes')

    elif 'current weather in' in command:
        reg_ex = re.search('current weather in (.*)', command)
        if reg_ex:
            city = reg_ex.group(1)
            weather = Weather()
            location = weather.lookup_by_location(city)
            condition = location.condition()
            talkToMe('The Current weather in %s is %s The tempeture is %.1f degree' % (city, condition.text(), (int(condition.temp())-32)/1.8))

    elif 'weather forecast in' in command:
        reg_ex = re.search('weather forecast in (.*)', command)
        if reg_ex:
            city = reg_ex.group(1)
            weather = Weather()
            location = weather.lookup_by_location(city)
            forecasts = location.forecast()
            for i in range(0,3):
                talkToMe('On %s will it %s. The maximum temperture will be %.1f degree.'
                         'The lowest temperature will be %.1f degrees.' % (forecasts[i].date(), forecasts[i].text(), (int(forecasts[i].high())-32)/1.8, (int(forecasts[i].low())-32)/1.8))


    elif 'email' in command:
        talkToMe('Who is the recipient?')
        recipient = myCommand()

        if 'John' in recipient:
            talkToMe('What should I say?')
            content = myCommand()

            #init gmail SMTP
            mail = smtplib.SMTP('smtp.gmail.com', 587)

            #identify to server
            mail.ehlo()

            #encrypt session
            mail.starttls()

            #login
            mail.login('username', 'password')

            #send message
            mail.sendmail('John Fisher', '[email protected]', content)

            #end mail connection
            mail.close()

            talkToMe('Email sent.')

        else:
            talkToMe('I don\'t know what you mean!')


talkToMe('I am ready for your command')

#loop to continue executing multiple commands
while True:
    assistant(myCommand())

3. Automated Keyword Extraction – Tutorial, Github Code, Video Tutorial

Keywords are an integral part of any article. They play a crucial role in page ranking systems and categorization algorithms in search engines. The purpose of this project is to identify keywords from a paragraph text. This short tutorial on automatic keyword extraction using gensim library in Python is quick and easy to learn.

Full code for Keyword Extraction in Python:

# import the gensim module and keywords function
from gensim.summarization import keywords

# Paragraph
paragraph = "Natural language processing (NLP) is the ability of a computer program to understand human language as it is spoken. NLP is a component of artificial intelligence (AI). The development of NLP applications is challenging because computers traditionally require humans to 'speak' to them in a programming language that is precise, unambiguous and highly structured, or through a limited number of clearly enunciated voice commands. Human speech, however, is not always precise -- it is often ambiguous and the linguistic structure can depend on many complex variables, including slang, regional dialects and social context."

# Get the keywords from the paragraph 
keywords_txt = keywords(paragraph) 
print(keywords_txt) 

4. Sentiment Analysis – Tutorial, Github Code, Video Tutorial

Sentiment analysis involves the identification and categorization of sentences into positive or negative. This categorization of sentences or opinions is used to determine one’s attitude towards writing the sentence. Analysis of Twitter tweets is the best way to get started with Sentiment Analysis in Python.

This video by Siraj Raval on Twitter Sentiment Analysis is quick and easy to learn.

5. Topics Identification – Tutorial, Github code, Video Tutorial

Topic identification is a type of multi-label classification project that can identify topics from articles. Moreover, it can be used to classify articles of magazines, newspapers, etc. from their headlines/titles. It uses an automated algorithm that can read through the text documents and automatically output the topics discussed. In addition, it has been extensively used in email segmentation in the inbox.

Watch this extensive video tutorial explaining all the concepts required for topic identification in Python.

6. Multilabel text classification – Tutorial, Github Code, Video Tutorial

The purpose of the project is to create a multi-label text classification system that automatically assigns tags for questions posted on a forum such as Stackoverflow or Quora. Moreover, it can be used to classify articles in newspapers and other documents as well.

Watch this tutorial series on Text Classification in Keras to make a news classifier in Python.

Text Classification in Keras (Part 2)

7. Sentence to Sentence Semantic Similarity – Tutorial (gensim), Video Tutorial

The semantic similarity of sentences is defined as the measure of how similar the meaning of the two sentences is. In other words, it defines the measure of sentences with the same intent. This video tutorial on finding the semantic similarity between two sentences uses spaCy module in Python.

Full Code for finding semantic similarity between sentences using spaCy in Python:

# Python code to measure similarity between two sentences using cosine similarity. 
import spacy

nlp = spacy.load("en")

# Sentences
s1 = nlp("The weather is rainy.")
s2 = nlp("It is going to rain outside.")

# Calculate the similarity
print("The similarity is:",s1.similarity(s2))

8. Inference-based Chatbot system – Tutorial, Github Code, Video Tutorial

Chatbots are in a rise for several use-cases in the hospitality and service industry, be it assistants for your device, or waiters at the restaurant. In addition, intelligent chatbots have been increasingly adopted for customer service in several sectors. Some common examples are Siri, Alexa, Google Assistant, etc. that are being used are voice assistants as well as chatbots.

This tutorial series by Sentdex uses Tensorflow to create a conversational chatbot in Python.

In Conclusion

How many of the above projects have you tried? Do you have any recommendations for us to include in the above list? Let us know.

Also, if you are trying to start or advance your career in the field of Computer Vision, you might like this article on “Open-Source Computer Vision Projects (With Tutorials)“.


Join Datacamp’s Natural Language Processing (NLP) Skill Track and become an expert NLP engineer: Enroll in the NLP in Python Skill Track Today!


Open-Source NLP Projects (With Tutorials)Open-Source NLP Projects (With Tutorials)

Do you want to learn Python, Data Science, and Machine Learning while getting certified? Here are some best selling Datacamp courses that we recommend you enroll in:

  1. Introduction to Python (Free Course) - 1,000,000+ students already enrolled!
  2. Introduction to Data Science  in Python- 400,000+ students already enrolled!
  3. Introduction to TensorFlow for Deep Learning with Python - 90,000+ students already enrolled!
  4. Data Science and Machine Learning Bootcamp with R - 70,000+ students already enrolled!

Leave a Comment