Understanding LexRank Text Summarization Algorithm
Hi, In this blog I will try to explain the LexRank Text Summarization Algorithm in a step-wise and simplified manner.
LexRank Algorithm is an Extractive-based text summarization method. In the LexRank algorithm, a sentence that is similar to many other sentences of the text has a high probability of being important. The approach is that a particular sentence is recommended by other similar sentences and hence is ranked higher. This method is based on Eigen Vector centrally. It follows an approach of a connected graph. Sentences are placed on vertices of the graph. The weight on the Edges is calculated using a cosine similarity metric.
Let's Get Started With The Algorithm…
In order to explain the algorithm clearly, let's take a sample sentence that we will summarize:-
Before moving to apply the algorithm we will have to pre-process the text data. That is we have to remove stop words, noisy data(i.e. remove words like and, is, a, etc..). Tokenize sentences. After that, we are good to proceed with applying our algorithm.
Step 1:
Calculate Tf (Term Frequency):
To get term frequency we first calculate word frequency (count number of occurrence of the word) in each sentence.
Then we divide word freq. with the total number of words in each sentence to get TF for each word.
Step 2:
Calculate IDF (Inverse Document Frequency):
In order to calculate IDF, we will have to first maintain a count of sentences having a particular word(i.e. Count sentences having a particular word).
Now we find IDF by -> Log(total no. of sentences / no. of sentences have the word (w))
So, now we have calculated TF and IDF.
Step 3:
Assume a graph of sentences as a 2D Matrix of |s|x|s| (where |s| is no. of sentences..here we will have a 4x4 matrix as total sentences are 4) and matrix[i][j] will represent weights between the sentence i and j.
Step 4:
Calculate weights of the graph (that is matrix[i][j] values, matix[0][0] denotes weight between sentence 1 and 1, similarly matrix[0][1] denotes weight between sentence 1 and 2 and so on…as matrix indexing starts with 0).
We use the cosine similarity formula to calculate weight -
The above formula can be broken into 2 parts- numerator(n) and denominator which can be further broken into 2 parts denominator 1(d1) and 2(d2)
n = sum of (tf(s[i])[word]*tf(s[j])[word] * idf[word]^2) for word in common-words of (s[i] & s[j])
[*Note to calculate n(numerator) we take tf and idf for only those word which are common between sentence i and j, for example between sentence 1 and 2 we take tf and idf of word like Elizabeth, party as these words are common in sentences 1 and 2]
d1 = squareRoot of (sum of [(tf(s[i])[word]*idf[word])^2] for word in s[i])
d2 = squareRoot of (sum of [(tf(s[j])[word]*idf[word])^2] for word in s[j])
[*Note to calculate d1 & d2 (denominators) we take tf and idf’s of all words in a particular sentences. In d1 we take tf and idf of all words in first sentence and for d2 we take tf and idf of all words in second sentence.
And, If any denominator is less than or equal to zero we replace matrix[i][j] with 0.0 .]
(tf(s[i])[word] means term frequency(tf) of a word in sentence [i=1,2,..]. Similarly (tf(s[j]))[word] means term frequency(tf) of word in sentence [j=1,2,..]. And idf[word] means inverse document frequency(idf) of a word. The values for which we can get from the calculated tf and idf in Step 1 and 2.
Therefore let's see the calculation for weight for sentences 1 and 2 that is for matrix[0][1] to understand the calculation clearly-
So after calculating weights, our final matrix will be-
Step 5:
Compare matrix weights with threshold=0.1 and maintain degree for each sentence.
We now compare our matrix with a threshold(=0.1) and maintain a degree count. We can take degree count in another list of length |s| (no. of sentences).
Where ever matrix[i][j] is greater than the threshold we replace the matrix[i][j] value with 1 and increase the degree[i] count by 1. Else we replace matrix[i][j] value by 0.
So now our matrix and the degree list will be-
Step 6:
We now replace our matrix[i][j] value with respect to degree values.
Where ever degree is 0 we replace it by 1. And then compute matrix value as matrix[i][j] / degree[i].
So now our Final matrix will be-
Step 7:
We now score our sentences. In order to find the sentences that will constitute our summary.
In order to score sentences we calculate Centrality:-
[*Note here B^T is the transpose of the matrix (which is rows is replaced with columns) we calculated earlier and P is an array of length |s| (no. of sentences) each having value [1 / |s|](here |s|=no. of sentences=4, hence 1/4=0.25)]
Therefore our Final p will be-
Step 8:
Find an average score to select sentences that will constitute our summary.
So average from our Final p is-
Step 9:
Display Summary.
We now display sentences that have Final p[i] ≥ average.
As here all values of Final P[i] ≥ average i.e. 0.25, So, here all the sentences will constitute in our summary.
Our Final Summary is-
Now we have understood how LexRank Algorithm is working. Now try out this algorithm on the different text data and analyze the summarized text.
To read the research paper for LexRank refer- https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html
To get the full code of the above algorithm refer- https://colab.research.google.com/drive/1Ars4w7YmtoqdCdaPUTnC0g3Vqcfd0jah?usp=sharing
I hope that we understood LexRank Algorithm well. And I was able to put my points clearly.
THANK YOU.