Ken's Random Thoughts: 2018

Coding tests: That necessary part of the hiring process that can be the difference between being hired and not. For the All Matched Set of Parenthesis coding test, I was provided with the following information:
- Given a number n, print out all the possible syntactically valid n pairs of parenthesis.
- When n = 0, there are no answers.
- When n = 1, the one answer is: ()
- When n = 2, the two answers are: ()(), (())
- When n = 3, the five answers are: ()()(), (()()), ()(()), (())(), ((()))
- What is the N-time of your code?
- As a secondary exercise, make your code work for N in the millions if it does not to begin with.
- This should be completed on a whiteboard in 30 minutes.

Sounds simple enough. Generally, I like to avoid using recursive functions, as they tend to be hard to diagnose problems with. My original answer in java was:

HashSet results = new HashSet(1);
results.add("()");
for (int i = 1; i < pairs; ++i) {
    HashSet newResults = new HashSet(results.size() * 3);
    for (String s:results) {
        newResults.add("(" + s + ")");
        newResults.add("()" + s);
    }
    results = newResults;
}
for(String s:results) {
    System.out.println(s);
}

This answer is actually wrong. When I was given the n = 3 answers, I was only given 4 answers, without the (())() answer. I coded to that. I was not told this at the time and the interviewer even confirmed that I had the right answer as it matched the example answer given for n = 3.

The correct answer using that style in Java should have been:

HashSet results = new HashSet(1);
results.add("()");
for (int i = 1; i < pairs; ++i) {
    HashSet newResults = new HashSet(results.size() * 3);
    for (String s:results) {
        newResults.add("(" + s + ")");
        newResults.add("()" + s);
        newResults.add(s + "()");
    }
    results = newResults;
}
for(String s:results) {
    System.out.println(s);
}

But even this answer has issues, as it is going to be limited by memory due to storing the entire n results in memory as well as the n-1 results at the same time. When actually running real code (whiteboards as a rule do not run code) in a standard 64-bit Java environment, it runs out of memory with n = 17. Note that n = 4 should have fourteen answers.

The N-time for the correct code is between 2^(n-1) and 3^(n-1). During the interview, I incorrectly gave the N-time as n^2 for the incorrect code, without any acknowledgement that it was right or wrong. N-time for this problem is misleading.

Sometime after the interview, I search for the answer using Google and with some minor optimizations came up with this:

protected final byte[] base;
    
protected RecursivePairParenthesis(int pairs) {
    base = new byte[pairs * 2];
}

public void PrintPairs(int pos, int open, int close) {
    if (close == pairs) {
        System.out.println(new String(base));
    } else {
        if (open > close) {
            base[pos] = ')';
            PrintPairs(pos + 1, open, close + 1);
        }
        if (open < pairs) {
            base[pos] = '(';
            PrintPairs(pos + 1, open + 1, close);
        }
    }
}
    
public void PrintPairParenthesis() {
    for (int i = 0; i < pairs * 2; ++i) base[i] = ')';
    PrintPairs(0, 0, 0);
}
    
public static void main(String[] args) {
    new RecursivePairParenthesis(Integer.parseInt(args[0])).PrintPairParenthesis();
}

Unfortunately it uses recursion to operate, so that is still an issue. It also runs much slower for the same values of n than my original correct answer. It’s N-time is still between 2^(n-1) and 3^(n-1), but is less than my original answer. It also does not operate to millions effectively, as you would have to configure the stack to have millions of frames on it.

Next I tried to unwind the recursive nature of the recursive answer, but I ended up with very ugly code that operates without any noticeable performance increase. This took me a couple hours to do, so certainly would not have been something I would have been able to complete in the 30 minutes given for the original coding test.

Still not satisfied, I did end up spending a couple more days (yes, days) coming up with a better answer. This answer will at least theoretically handle n in the millions (more on than in a minute). This code looks like this:

int[] openPositions = new int[pairs];
final int pairs2 = pairs * 2;
byte[] sb = new byte[pairs2];
for (int i = 0; i < pairs; ++i) {
    sb[i] = '(';
    sb[i + pairs] = ')';
}
for (int i = 0; i < pairs; ++i) openPositions[i] = i;
int pos = pairs - 1;
while (pos >= 0) {
    System.out.println(new String(sb));
    pos = pairs - 1;
    sb[openPositions[pos]] = ')';
    if (++openPositions[pos] > pos * 2) {
        while (--pos >= 0) {
            sb[openPositions[pos]] = ')';
            if (++openPositions[pos] <= pos * 2) {
                sb[openPositions[pos]] = '(';
                for (int i = pos + 1, x = openPositions[pos] + 1; i < pairs; ++i, ++x) {
                    sb[x] = '(';
                    openPositions[i] = x;
                }
                break;
            }
        }
    } else {
        sb[openPositions[pos]] = '(';
    }
}

It only uses arrays with n or n * 2 elements, so memory usage is fixed. It does not use recursion, so has no stack overflow issues. It runs ever so slightly faster than the recursive or unwound answers, but still much slower than my first correct answer. It’s N-time should be slightly better than the recursive version, but still between 2^(n-1) and 3^(n-1).

The keys to this final method are realizing several things:
- There are exactly n opening parenthesis and n closing parenthesis.
- The first open parenthesis can only be in the first position and the last closing parenthesis can only be in the last position.
- The second open parenthesis can only be in the second or third position, with a corresponding pattern for the second to last closing parenthesis.
- Each answer will be exactly 2*n characters long.

With these in mind, this makes the pattern of open parenthesis look like a spring or wave:

((((....
(((.(...
(((..(..
(((...(.
((.((...
((.(.(..
((.(..(.
((..((..
((..(.(.
(.(((...
(.((.(..
(.((..(.
(.(.((..
(.(.(.(.

N in the millions is not good. At n = 64, there are more than 2^64 answers or more than 9 exa answers. With each answer at 128 bytes, that is more than 1 zettabytes of output. Given that some really big companies have exabytes of data, this one coding test problem is suppose to generate more data than these large companies currently have.

Did I pass the second time around? Maybe. But really the coding test is flawed:
1) The code has virtually no practical value: I can come up with two very obscure questionable uses for this output.
2) Giving the wrong answers as examples does not make for good code.
3) To give an answer that makes it beyond n = 17 and in the 30 minute time frame would have required knowing the question and correct code ahead of time. My first correct version takes ~3.5 seconds for n=15. My last spring version takes ~45 seconds for n=15. Both take about x3 longer for each successive n. In other words, if memory is not an issue, then use the memory version since it is 13 times faster.
4) To give the best correct code answer, one that theoretically could handle n in the millions would have taken way longer than 30 minutes unless you memorized the answer beforehand. I do not believe I could memorize it well enough to spit it back out correctly.
5) N-time is meaningless in comparing these versions. The worst N-time had the best performance by a significant margin. N-time can sometimes be a useful tool, but it has limits.
6) n in the millions is not practical for this problem with today’s computing technology, nor is it useful.
7) Writing code that you cannot run is bound to have errors.

Warning: This next part is beyond boring. The two very obscure use cases for this code that I can come up with are:
A) When populating an n-tree based database and you want to evaluate performance across all possible data organization, you could use this code as a basis for populating the data. You would do this to determine practical points at which the database tree should be rebalanced. For larger data sets, it becomes impractical to test every possible structure.
B) Used as a basis for generating formulas that have n variables and are properly formatted. You would further add coefficients, powers, and operators. It starts to break down if you want to include a variable more than once. Modern formula generation methodology uses genetics to find good formulas rather than trying every possible formula.

Ken's Random Thoughts

Monday, December 3, 2018

All Possible Matched Sets of Parenthesis

Blog Archive

About Me