TextAnalyzer – Automatically Extract Characteristic Words

Share

TextAnalzyer is a text analyzer tool that finds out words that are characteristic for a given input file. It is independent from any language, and even seems to work well with HTML files.

This program is only a little prototype, that shows that this technique seems to work. It’s public domain, feel free to do whatever you like with it:

Download

textanalyze.rb, Licence: Public Domain.

Example

  1. Build an index with a reasonably large amount of data, it should be much larger than the text you want to analyze. For example, I have indexed 76 of Grimm’s fairy tales with this command:

    cat *.txt | ruby ../textanalzye.rb c

    This creates the file wordcount.dat that contains the word count of each word.

  2. To find out which words are characteristic for a specific text, the previously generated reference data is used. To continue the example:

    cat LittleRedRidingHood.txt |ruby ../textanalzye.rb a

    This produces the output

    hood, grandma, riding, hunter, red

    So the above words seem to be very relevant to LittleRedRidingHood.txt when compared to all of Grimm’s tales.

Other Uses

The previous example seems a bit useless, but there certainly are a lot of useful applications. Here are some ideas:

  • Quickly find out what an unknown text is about
  • Automatically extract important words from blog entries
  • Find out what a text is about by reading just 5 words
  • Automatically create very short descriptions for a large number of documents

The currently implemented algorithm even works well with HTML files (To my own surprise. Actually, I am surprised that it works at all…)

Algorithm

The main idea is quite simple: the algorithm assumes, that important words are :

  1. Often used in the to-be-analyzed text
  2. Seldom used in other texts

For example, the second condition ensures that words like “the”, “and” etc. are not considered important.

The full algorithm to calculate the score of a word (higher==more important) is done with this formula:

tanh(curVal/curWords*200) - 5*tanh((allVal-curVal)/(allWords-curWords)*200)

The variables:

  • curVal: How often the word to score is present in the to-be-analyzed text.
  • curWords: Total number of words in the to-be-analyzed text.
  • allVal: How often the word to score is present in the indexed dataset.
  • allWords: Total number of words of the indexed dataset.

Please don’t ask me how or why this works. I have no idea. I have invented this formula in one of the rare moments when I was enlighted for approximately 10 seconds, quickly wrote it down, and immediately forgot how it worked because my mind was overwhelmed by its beauty and simplicity… Or something like that ;-)

Share
  • Pingback: TextAnalyzer in Python — Martin Ankerl

  • http://empirecollective.co.uk Christopher de Beer

    hi,

    found this post via your contribution on stack overflow [ http://stackoverflow.com/questions/5322317/what-is-a-good-approach-for-extracting-keywords-from-user-submitted-text ](recently) but just to inform you an other would be visitors of the link you posted in the comment above, that site exists although is not maintained and no longer seems to have the python port of the textAnalyzer :)

    Just saying.

  • Attilio Altieri

    Hi! I was trying to run your code but I get an error:

    cat Texts/SmallTexts/My file.txt | ruby textanalyze.rb a
    loading…
    1398 words loaded

    Enter Filename:
    textanalyze.rb:69:in `initialize’: No such file or directory – ABSTRACT (Errno::ENOENT)
    from textanalyze.rb:69:in `open’
    from textanalyze.rb:69:in `compareIndex’
    from textanalyze.rb:127:in `’

    I wrote a small program similar to yours and I wanted to check your code to improve keyword detection.

    • http://martin.ankerl.com/ Martin Ankerl

      The api of my code is pretty bad… first you need to create an index of common words, with

      cat *.txt | ruby textanalyze.rb c

      then you analyze a text with

      ruby textanalyze.rb a

      then enter a filename you want to have analyzed

      • Attilio Altieri

        I couldn’t get your code to work, so I decided to use only your formula on my code. After some failures I discovered that when you divide two integers the result is an integer (!) in Ruby ( http://mysite.verizon.net/hpassel/thinkruby/book/ch03.html ). So in your code you should add “.to_f” every time there’s a division . I was always getting 0 before.

        • http://martin.ankerl.com/ Martin Ankerl

          in my code all the values are already floats, I do this in line 90 where I mutliply each value with 1.0 which converts it into floats as well. But I agree that to_f would have been nicer.