A lexical Analysis of Obama's State of the Union Speech

By Peter Bromberg

Each year I like to play with the text of the president's State of the Union speech to get the word frequency. You can get some pretty good information from how many times certain words are used. This year I do it via LINQ as a fun programming exercise.

For this exercise, I put together an extension method on the String class called GetWordFrequency. This allows us to call the method directly from the string of text that comprises the president's speech.

Here is the method:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;

namespace SOTU
{
    public static class CustomExtensions
    {

         public static string[] stopwords = {
                                               "a",
                                               "about",
                                               "above",
                                               "across",
                                               "after",
                                               "again",
                                               "against",
                                               "all",
                                               "almost",
                                               "alone",
                                               "along",
                                               "already",
                                               "also",
                                               "although",
                                               "always"
                                                    // rest of list abbreviated - full list in code sample in the download
                                           };
        /// <summary>
        /// Analyze word frequency for a given string.
        /// </summary>
public static Dictionary<string, int> GetWordFrequency(this string input)
{
return input
.Split(new char[] { ' ' })
.Where(i => i.Trim() != String.Empty && Regex.IsMatch(i,@"\w"))
.Select(i => Regex.Replace(i,@"[^A-Za-z0-9]+$","").ToLower())
.Where(x => !stopwords.Contains(x))
.GroupBy(w => w)
.OrderByDescending(group => group.Count())
.ToDictionary(group => group.Key, group => group.Count());
}

  We start out with a string array of stopwords. These are common words like "a", "and", "the" and so on, which we're not really interested in..

  Then I construct a LINQ query that splits the string into words, removes whitespace and non-alphanumeric items, casts to lowercase and finally throws away anything that's in the stopwords list.

  The result is returned to the caller, which can then be displayed, or in this case, also saved to a file for further analysis.

  Here are the first 40 "most used" words from ObamaSpeak:

  american,33
jobs,28
america,27
energy,23
tax,23
people,20
americans,18
country,17
congress,15
world,14
help,14
businesses,13
don't,12
economy,12
built,12
you're,12
million,11
tonight,11
i'm,11
workers,11
business,11
companies,11
pay,11
financial,10
oil,10
home,9
rules,9
debt,9
industry,9
job,9
gas,9
clean,9
own,8
stop,8
nearly,8
let's,8
taxes,8
education,8
power,8
government,8

My original method had a secondary loop, but thanks to a suggestion by fellow MVP Chris Eargle, the above code is even more efficient.

You can download the sample solution, which includes the text of the speech, here.

Popularity  (1988 Views)
Picture
Biography - Peter Bromberg
Peter Bromberg is a C# MVP, MCP, and .NET expert who has worked in banking, financial and telephony for over 20 years. Pete focuses exclusively on the .NET Platform, and currently develops SOA and other .NET applications for a Fortune 500 clientele. Peter enjoys producing digital photo collage with Maya,playing jazz flute, the beach, and fine wines. You can view Peter's UnBlog and IttyUrl sites. Follow Microsoft MVP
Create New Account
Article Discussion: A lexical Analysis of Obama's State of the Union Speech
Peter Bromberg posted at Wednesday, January 25, 2012 11:59 AM
Kenneth NONE replied to Peter Bromberg at Wednesday, January 25, 2012 12:44 PM
Why not use the overload to Split and pass StringSplitOptions.RemoveEmptyEntries for the second parameter? Then you can ditch the subsequent Where.
Peter Bromberg replied to Kenneth NONE at Wednesday, January 25, 2012 12:44 PM
Excellent Idea. Probably more efficient, too.