Language Statistical Analysis

For one of my IT based courses, I was required to do a project involving computers and language. Very vague requirements, due to the course being aimed at people who have little to no IT baggage (and as it wasn’t a programming course… even more so!) I decided to enjoy it, and therefore write some code.

My initial project was to take various texts from 3 different time periods (all texts in English) and attempt to find word variations through time. The main challenge here would be finding the algorithm to decide how one word becomes another. A friend told me he had a C++ library which would be suitable for this, so I attempted some OO programming (it did not go well). As it turns out, the library he had thought of was useless to me, so I decided to go functional and write everything in Perl.

So, this was the first time I was working with Perl, (previous experience had been to add “echo” functions in already existing scripts) so first I had to decide what I wanted and could do.

  1. Take text from input file and make everything lowercase
  2. Create a table containing each word, the number of times it appears and other statistical elements.

Unlike BASH, Perl doesn’t allow for the dynamic creation of tables, but this is resolved with hash tabled. I therefore ended up with a Hash table, structured: word{word occurrences}.

The next step was to use the numbers, at first I attempted writing subroutines, but when it took me an hour to write a simple function, I decided to look up packages and how surprising that CPAN had a discrete statistics package. The joy! I no longer needed to learn how to do OO (the package did everything for me), and I had no need to write sub routines, as the package gave me all the tools necessary.

After some 20h of coding (to only get 100 lines of code), the conclusion was that the language’s statistics don’t change significantly, and the main changes are:

  • New vocabulary replacing old (computer vs type-writer)
  • Manner of communication (Twitter will have a reduced vocabulary span with respect to Shakespeare’s works, if comparing type-token ratios).

My Perl verdict: definitely a fun language, although requires a large amount of commenting due to its free variable use and regex creation.  I will definitely continue to explore this language.

Much help obtained from PerlMonks