How to sort a list of words? The bucket sorting method

In the last post I said that I just started my MSc in health informs. The first semester has five modules, one of those modules is about object oriented programming.

The object oriented programming module starts with introduction to programming, mathematics in programming, flow charts and pseudo codes.

When I was going through past papers one question caught my eye, at a glance it looks like an easy question, but the more you look at it, the difficult it gets.

Write a pseudo code to input five names and arrange them in alphabetical order

Looks simple enough right? But what about names that start with the same letter? The differentiating letter can be the second, third or nth letter.

Even though the question asks for five names, a good algorithm should be able to tackle hundreds or thousands of names.

Approaches

Some of my colleagues suggested a method that was about brute forcing.

  • First check the first letter a word (n)
  • Then check the first letter of the second word (n+1)
  • If the first letter of the n+1 word is lesser than the nth word then swap them
  • Now check n against n+2
  • Continue until the first letter of a word is greater than that of the first letter of the nth word
  • Now repeat the same process for n+1, n+2 word and so on

Even though this approach can be good for small number of words, the complexity can increase faster with the addition of each word than the total number of words.

Trying the bucket sort approach

I had a hunch that there should be a better way, I knew on the back of my mind, that there has to be a better method. Plus I had to make a proof of concept working code for a different method.

After trying different approaches I came across this Wikipedia article about bucket sort, https://en.wikipedia.org/wiki/Bucket_sort

This was my approach

  • Generate 26 empty buckets, one bucket representing each letter of the alphabet
  • Start with first letter of each word
  • Loop though each word and assign a bucket according to the first letter of each word
  • Now loop through the 26 buckets, if the bucket contains only one word then add that to the output
  • If there is more than one word, then repeat the same process but now starting with the second letter of the word and so on

The pseudo code

BEGIN
  Get names as array
  Output sort(names, 0)

  Begin function sort(names, indent) {
      Set buckets = empty 26 arrays
      Set output 

      For every name in names
         Set char = char value of first letter-97
          Set bucket[char] = name
      End for

       For every bucket in buckets
          If bucket has only one value
             Add that to output
          Else if more than one
             Indent += 1
             Sort(bucket)
          End if
       End for
    Return output
  End function
END

Working code

I’ve made the simple example of this algorithm using JavaScript. Even though the module was taught in Java, since Java arrays are fixed in length, I found difficulties in implementing this code in Java.

const names = ["ruky" , "nethmi", "janith" , "rukmal", "sahan" , "rukshan"]

console.log(createBuckets(names, 0))
function createBuckets(names, indent) {
   const buckets = []   const output =  []
   var n = 0   while(n<=26){
     buckets[n] = []
     n++
   }
   for (var i = 0; i <= names.length-1; i++) {
      var char = names[i].charCodeAt(indent)-97
      buckets[char].push(names[i])
   }
   for (var i = 0; i <= buckets.length-1; i++) {
     if(buckets[i].length > 1) {
       const tempOutput = createBuckets(buckets[i],indent+1)
       for (var o=0; o <= tempOutput.length-1;o++) {
         if (tempOutput[o].length === 1) output.push(tempOutput[o])
       }
     } else if (buckets[i].length === 1) {
       output.push(buckets[i])
     }
   }
   return output
}

You can see the code in the following bin https://jsbin.com/jilotavocu/edit?js,console

Limitations

This algorithm has following limitations, it will throw an error when there duplicates.

I haven’t put a failsafe method to check for duplicates. But it’s fairly simple to implement, and I think the pseudo code is more than enough to get majority of the marks, and it’s also better than brute forcing. Also better to arrange large number of words.

Do you have a better implementation? I’d love to know.