How to extract keywords from text with TF-IDF and Python’s Scikit-Learn
--
Back in 2006, when I had to use TF-IDF for keyword extraction in Java, I ended up writing all of the code from scratch. Neither Data Science nor GitHub were a thing back then and libraries were just limited.
The world is much different today. You have several libraries and open-source code repositories on Github that provide a decent implementation of TF-IDF. If you don’t need a lot of control over how the TF-IDF math is computed, I highly recommend re-using libraries from known packages such as Spark’s MLLib or Python’s scikit-learn.
The one problem that I noticed with these libraries is that they are meant as a pre-step for other tasks like clustering, topic modeling, and text classification. TF-IDF can actually be used to extract important keywords from a document to get a sense of what characterizes a document. For example, if you are dealing with Wikipedia articles, you can use tf-idf to extract words that are unique to a given article. These keywords can be used as a very simple summary of a document, and for text-analytics when we look at these keywords in aggregate.
In this article, I will show you how you can use scikit-learn to extract keywords from documents using TF-IDF. We will specifically do this on a stack overflow dataset. If you want access to the full Jupyter Notebook, please head over to my repo.
Important note: I’m assuming that folks following this tutorial are already familiar with the concept of TF-IDF. If you are not, please familiarize yourself with the concept before reading on. There are a couple of videos online that give an intuitive explanation of what it is. For a more academic explanation I would recommend my Ph.D advisor’s explanation.
Dataset
In this example, we will be using a Stack Overflow dataset which is a bit noisy and simulates what you could be dealing with in real life. You can find this dataset in my tutorial repo.
Notice that there are two files. The larger file, stackoverflow-data-idf.json
with 20,000 posts, is used to compute the Inverse Document Frequency (IDF). The smaller file, stackoverflow-test.json
with 500 posts, would be used as a test set for us to extract keywords from. This dataset is based…