Java Notes: Example - WordFrequency

This program reads files of words and lists their frequencies. A file of words to ignore can also be supplied. This example illustrates the use of generic data structures (Sets, Maps, and ArrayList), regular expressions (as used by Scanner), and Comparators.

Exercises. A command-line interface is provided, but an interesting exercise is to write a GUI interface.

A text command-line interface

  1 
  2 
  3 
  4 
  5 
  6 
  7 
  8 
  9 
 10 
 11 
 12 
 13 
 14 
 15 
 16 
 17 
 18 
 19 
 20 
 21 
 22 
 23 
 24 
 25 
 26 
 27 
 28 
 29 
 30 
 31 
 32 
 33 
 34 
 35 
 36 
 37 
 38 
 39 
 40 
 41 
 42 
 43 
 44 
 45 
 46 
// File   : data-collections/wordfreq2/WordFrequencyCmd.java
// Purpose: Main program to test WordCounter
//          Prints word frequency in source file. Ignores words in ignore file.
//          Uses Sets, Maps, ArrayList, regular expressions, BufferedReader.
// Author : Fred Swartz - April 2007 - Placed in public domain.

import java.io.*;
import java.util.*;

/////////////////////////////////////////////////////////////// WordFrequencyCmd
class WordFrequencyCmd {
    //===================================================================== main
    public static void main(String[] unused) {
        Scanner in = new Scanner(System.in);
        
        try {
            //... Read two file names from the input.
            System.out.println("Name of file containing text to analyze:");
            File inputFile = new File(in.nextLine());
            
            System.out.println("Name of file containing words to ignore:");
            File ignoreFile = new File(in.nextLine());
            
            //... Supply two files to WordCounter.
            WordCounter counter = new WordCounter();
            counter.ignore(ignoreFile);
            counter.countWords(inputFile);
            
            //... Get the results.
            String[] wrds   = counter.getWords(WordCounter.SortOrder.BY_FREQUENCY);
            int[] frequency = counter.getFrequencies(WordCounter.SortOrder.BY_FREQUENCY);
            
            //... Display the results.
            int n = counter.getEntryCount();
            for (int i=0; i<n; i++) {
                System.out.println(frequency[i] + " " + wrds[i]);
            }
            
            System.out.println("\nNumber of input words: " + counter.getWordCount());
            System.out.println("Number of unique words: " + n);
            
        } catch (IOException iox) {
            System.out.println(iox);
        }
    }
}

The "Model" -- Word frequency without any user interface

The model (the logic of the program without any user interface) is implemented primarily in the WordCounter class, which uses utility classes: ComparatorFrequency and ComparatorAlphabetic to sort, and Int to record the frequency counts.

  1 
  2 
  3 
  4 
  5 
  6 
  7 
  8 
  9 
 10 
 11 
 12 
 13 
 14 
 15 
 16 
 17 
 18 
 19 
 20 
 21 
 22 
 23 
 24 
 25 
 26 
 27 
 28 
 29 
 30 
 31 
 32 
 33 
 34 
 35 
 36 
 37 
 38 
 39 
 40 
 41 
 42 
 43 
 44 
 45 
 46 
 47 
 48 
 49 
 50 
 51 
 52 
 53 
 54 
 55 
 56 
 57 
 58 
 59 
 60 
 61 
 62 
 63 
 64 
 65 
 66 
 67 
 68 
 69 
 70 
 71 
 72 
 73 
 74 
 75 
 76 
 77 
 78 
 79 
 80 
 81 
 82 
 83 
 84 
 85 
 86 
 87 
 88 
 89 
 90 
 91 
 92 
 93 
 94 
 95 
 96 
 97 
 98 
 99 
100 
101 
102 
103 
104 
105 
106 
107 
108 
109 
110 
111 
112 
113 
114 
115 
116 
117 
118 
119 
120 
121 
122 
123 
124 
125 
126 
127 
128 
129 
130 
131 
132 
133 
134 
135 
136 
137 
138 
139 
140 
141 
142 
143 
144 
145 
146 
147 
148 
149 
150 
151 
152 
153 
154 
155 
156 
157 
158 
159 
160 
161 
162 
163 
164 
165 
166 
167 
168 
169 
170 
171 
172 
173 
174 
175 
176 
177 
178 
179 
180 
181 
182 
183 
184 
185 
186 
187 
188 
189 
190 
191 
192 
193 
194 
195 
196 
197 
198 
199 
200 
201 
202 
203 
204 
// File   : data-collections/wordfreq2/WordCounter.java
// Purpose: Provides methods to read ingore file, input file, get results.
//          Computes the frequency for each word.
// Author : Fred Swartz - April 2007 - Placed in public domain.

import java.io.*;
import java.util.*;
import java.util.regex.*;

/** Computes word frequency in source file; ignores words in ignore file.
 * Uses generic Sets, Maps, ArrayLists, regular expressions, Scanner.
 * @author Fred Swartz
 * @version 2007-05-06
 */
public class WordCounter {
    //================================================================ constants
    private static final Comparator<Map.Entry<String, Int>> SORT_BY_FREQUENCY = 
            new ComparatorFrequency();
    private static final Comparator<Map.Entry<String, Int>> SORT_ALPHABETICALLY = 
            new ComparatorAlphabetic();
    public enum SortOrder {ALPHABETICALLY, BY_FREQUENCY}
    
    //=================================================================== fields
    Set<String>      _ignoreWords;   // Words to ignore.
    Map<String, Int> _wordFrequency; // Words -> frequency
    int              _totalWords;    // Total source words.
    
    //============================================================== constructor
    /** Constructor */
    public WordCounter() {
        _ignoreWords   = new HashSet<String>();
        _wordFrequency = new HashMap<String, Int>();
        _totalWords    = 0;
    }
    
    //=================================================================== ignore
    /**
     * Reads file of words to ignore. Ignore words are added to a Set.
     *  The IOException is passed to caller because we certinaly don't
     *  know what the user interface issue is.
     * 
     * @param ignoreFile File of words to ignore.
     */
    public void ignore(File ignoreFile) throws IOException {
        Scanner ignoreScanner = new Scanner(ignoreFile);
        ignoreScanner.useDelimiter("[^A-Za-z]+");
        
        while (ignoreScanner.hasNext()) {
            _ignoreWords.add(ignoreScanner.next());
        }
        ignoreScanner.close();  // Close underlying file.
    }
    
    //=================================================================== ignore
    /**
     * Takes String of words to ignore. Ignore words are added to a Set.
     * 
     * @param ignore String of words to ignore.
     */
    public void ignore(String ignoreStr) {
        Scanner ignoreScanner = new Scanner(ignoreStr);
        ignoreScanner.useDelimiter("[^A-Za-z]+");
        
        while (ignoreScanner.hasNext()) {
            _ignoreWords.add(ignoreScanner.next());
        }
    }
    
    
    //=============================================================== countWords
    /** Record the frequency of words in the source file.
     *  May be called more than once.
     *  IOException is passed to caller, who might know what to do with it.
     *@param File of words to process.
     */
    public void countWords(File sourceFile) throws IOException {
        Scanner wordScanner = new Scanner(sourceFile);
        wordScanner.useDelimiter("[^A-Za-z]+");
        
        while (wordScanner.hasNext()) {
            String word = wordScanner.next();
            _totalWords++;
            
            //... Add word if not in map, otherewise increment count.
            if (!_ignoreWords.contains(word)) {
                Int count = _wordFrequency.get(word);
                if (count == null) {    // Create new entry with count of 1.
                    _wordFrequency.put(word, new Int(1));
                } else {                // Increment existing count by 1.
                    count.value++;
                }
            }
        }
        wordScanner.close();  // Close underlying file.
    }
    
    
    //=============================================================== countWords
    /** Record the frequency of words in a String.
     *  May be called more than once.
     *@param String of words to process.
     */
    public void countWords(String source) {
        Scanner wordScanner = new Scanner(source);
        wordScanner.useDelimiter("[^A-Za-z]+");
        
        while (wordScanner.hasNext()) {
            String word = wordScanner.next();
            _totalWords++;
            
            //... Add word if not in map, otherewise increment count.
            if (!_ignoreWords.contains(word)) {
                Int count = _wordFrequency.get(word);
                if (count == null) {    // Create new entry with count of 1.
                    _wordFrequency.put(word, new Int(1));
                } else {                // Increment existing count by 1.
                    count.value++;
                }
            }
        }
    }
    
    //============================================================= getWordCount
    /** Returns number of words in all source file(s).
     *@return Total number of words proccessed in all source files.
     */
    public int getWordCount() {
        return _totalWords;
    }
    
    //============================================================ getEntryCount
    /** Returns the number of unique, non-ignored words, in the source file(s).
     *  This number should be used to for the size of the arrays that are
     *  passed to getWordFrequency.
     *@return Number of unique non-ignored source words.
     */
    public int getEntryCount() {
        return _wordFrequency.size();
    }
    
    //========================================================= getWordFrequency
    /** Stores words and their corresponding frequencies in parallel array lists
     *  parameters.  The frequencies are sorted from low to high.
     * @param words Unique words that were found in the source file(s).
     * @param counts Frequency of words at corresponding index in words array.
     */
    public void getWordFrequency(ArrayList<String> out_words,
            ArrayList<Integer> out_counts) {
        //... Put in ArrayList so sort entries by frequency
        ArrayList<Map.Entry<String, Int>> entries =
                new ArrayList<Map.Entry<String, Int>>(_wordFrequency.entrySet());
        Collections.sort(entries, new ComparatorFrequency());
        
        //... Add word and frequency to parallel output ArrayLists.
        for (Map.Entry<String, Int> ent : entries) {
            out_words.add(ent.getKey());
            out_counts.add(ent.getValue().value);
        }
    }
    
    //================================================================= getWords
    /** Return array of unique words, in the order specified.
     * @return An array of the words in the currently selected order.
     */
    public String[] getWords(SortOrder sortBy) {
        String[] result = new String[_wordFrequency.size()];
        ArrayList<Map.Entry<String, Int>> entries =
                new ArrayList<Map.Entry<String, Int>>(_wordFrequency.entrySet());
        if (sortBy == SortOrder.ALPHABETICALLY) {
            Collections.sort(entries, SORT_ALPHABETICALLY);
        } else {
            Collections.sort(entries, SORT_BY_FREQUENCY);
        }
        
        //... Add words to the String array.
        int i = 0;
        for (Map.Entry<String, Int> ent : entries) {
            result[i++] = ent.getKey();
        }
        return result;
    }
    
    //=========================================================== getFrequencies
    /** Return array of frequencies, in the order specified.
     * @return An array of the frequencies in the specified order.
     */
    public int[] getFrequencies(SortOrder sortBy) {
        int[] result = new int[_wordFrequency.size()];
        ArrayList<Map.Entry<String, Int>> entries =
                new ArrayList<Map.Entry<String, Int>>(_wordFrequency.entrySet());
        if (sortBy == SortOrder.ALPHABETICALLY) {
            Collections.sort(entries, SORT_ALPHABETICALLY);
        } else {
            Collections.sort(entries, SORT_BY_FREQUENCY);
        }
        
        //... Add words to the String array.
        int i = 0;
        for (Map.Entry<String, Int> ent : entries) {
            result[i++] = ent.getValue().value;
        }
        return result;
    }
}

Comparators

The key-value (Map.Entry) pairs in a HashMap are not sorted, but users would certainly like to see them sorted, either alphabetically by word, or by frequency.

Because sorts are usually based on comparison of two values at a time, all that is needed is a way to compare two values. That's what a Comparator does -- it defines a compare() method that tells how two values compare.

  1 
  2 
  3 
  4 
  5 
  6 
  7 
  8 
  9 
 10 
 11 
 12 
 13 
 14 
// File   : data-collections/wordfreq2/ComparatorAlphabetic.java
// Purpose: A comparator to sort Map.Entries alphabetically.
// Author : Fred Swartz - March 2005 - Placed in public domain.

import java.util.*;

/////////////////////////////////////////////////////// class ComparatorAlphabetic
/** Order words alphabetically. */
class ComparatorAlphabetic implements Comparator<Map.Entry<String, Int>> {
    public int compare(Map.Entry<String, Int> entry1
                     , Map.Entry<String, Int> entry2) {
        return entry1.getKey().compareTo(entry2.getKey());
    }
}
  1 
  2 
  3 
  4 
  5 
  6 
  7 
  8 
  9 
 10 
 11 
 12 
 13 
 14 
 15 
 16 
 17 
 18 
 19 
 20 
 21 
 22 
 23 
 24 
 25 
 26 
 27 
// File   : data-collections/wordfreq2/ComparatorFrequency.java
// Purpose: A comparator to sort Map.Entries by frequency.
// Author : Fred Swartz - April 2007 - Placed in public domain.

import java.util.*;

////////////////////////////////////////////////////// class ComparatorFrequency
/** Order words from least to most frequent, put ties in alphabetical order. */
class ComparatorFrequency implements Comparator<Map.Entry<String, Int>> {
    public int compare(Map.Entry<String, Int> obj1
                     , Map.Entry<String, Int> obj2) {
        int result;
        int count1 = obj1.getValue().value;
        int count2 = obj2.getValue().value;
        if (count1 < count2) {
            result = -1;
            
        } else if (count1 > count2) {
            result = 1;
            
        } else { 
            //... If counts are equal, compare keys alphabetically.
            result = obj1.getKey().compareTo(obj2.getKey());
        }
        return result;
    }
}

Mutable Integers

Because only objects (not primitives) can be stored in Collections data structures, the integer count must be an object, not just an int. The natural choice would be the Integer wrapper type, but it is immutable, which means that once an Integer object is created, its value can never be changed. But here we need to update the value, so here's a class that stores an int that can be changed.

  1 
  2 
  3 
  4 
  5 
  6 
  7 
  8 
  9 
 10 
 11 
 12 
 13 
 14 
 15 
 16 
 17 
 18 
 19 
 20 
 21 
// File   : data-collections/wordfreq2/Int.java
// Purpose: Simple value class to hold a mutable int.
// Author : Fred Swartz - March 2005 - Placed in public domain.

//////////////////////////////////////////////////////////////// value class Int
/** Utility class to keep int as Object but allow changes (unlike Integer).
 * Java collections hold only Objects, not primitives, but need to update value.
 * The intention is that the public field should be used directly.
 * For a simple value class this is appropriate.
 */
class Int {
    //=================================================================== fields
    public int value;  // Just a simple PUBLIC int.  
    
    //============================================================== constructor
    /** Constructor 
        @param value Initial value. */
    public Int(int value) { 
        this.value = value;  
    }
}