From Retiring a Great Interview Problem (via Retiring a Great Interview Problem), this is the best I could get in less than 30 minutes:
def segment_string(string, dictionary):
def split(string, dictionary):
if string == '':
return []
for w in dictionary:
if string.startswith(w):
partial = split(string[len(w):], dictionary)
if None in partial:
continue
return [w,] + partial
return [None, ]
result = split(string, dictionary)
if None in result:
return None
else:
return ' '.join(result)
if __name__=="__main__":
dictionary = [ 'pasty', 'app', 'apple', 'pi', 'pie', ]
print segment_string('applepie', dictionary)
My solution is recursive with backtracking and works with n splits (general case). I can’t make a proper big O analysis though. In fact I’m not completely sure the solution is correct, but you know that developers are nothing if not compulsive problem solvers, so I had to try.
Update (updated): this is a iterative solution using backtracking and it’s supposed to be equivalent to the recursive code:
def segment_string_iter(string, dictionary):
solution = []
back = []
start = 0
while string != '':
for i in range(start, len(dictionary)):
w = dictionary[i]
if string.startswith(w):
solution.append(w)
back.append([string, i+1])
string = string[len(w):]
start = 0
break
else:
if not len(back):
return None
solution.pop()
string, start = back.pop()
return ' '.join(solution)
When I was trying to improve the performance of my solution I thought I found something interesting, but I was wrong (the iterative code had a bug). Now I’ll try to improve worst cases on the iterative code (if I don’t find another bug first!).
Update 2 (memoization): finally I got the iterative version to work with memoization, which seems to be a cool word to say caching.
Adding memoization to the recursive version of the code resulted to be very easy:
def segment_string(string, dictionary):
def split(string, dictionary):
if string == '':
return []
if string in cache.keys():
return cache[string]
for w in dictionary:
if string.startswith(w):
partial = split(string[len(w):], dictionary)
if None in partial:
cache[string] = [None, ]
continue
result = [w,] + partial
cache[string] = result
return result
return [None, ]
cache = {}
result = split(string, dictionary)
if None in result:
return None
else:
return ' '.join(result)
Basically we cache all the partial results, and when we find a string we already know, we can get that result from the cache. Because the function is recursive, the new functionality was easy to add.
Well, I had some problems to get the iterative version right because I tried to cache all results at first (goods and dead ends), but then I realized that was useless because we want to speed up the worst cases and that means the dead ends only in the decision tree.
So that’s what I’m doing in the following code:
def segment_string_iter(string, dictionary):
solution = []
back = []
start = 0
cache = {}
while string != '':
if string in cache.keys():
if not len(back):
return None
else:
solution.pop()
string, start = back.pop()
continue
for i in range(start, len(dictionary)):
w = dictionary[i]
if string.startswith(w):
solution.append(w)
back.append([string, i+1])
string = string[len(w):]
start = 0
break
else:
if not len(back):
return None
cache[string] = None
solution.pop()
string, start = back.pop()
return ' '.join(solution)
The iterative solution looks more complicated (and it is), although is equivalent for worst cases:
LEN = 100
dictionary = ['a'*i for i in range(1, LEN+1)]
string = 'a'*(LEN-1) + 'b'
t = timeit.Timer("print 'result:', segment_string(string, dictionary)", "from __main__ import segment_string, string, dictionary")
print "recursive", t.timeit(1)
t = timeit.Timer("print 'result:', segment_string_iter(string, dictionary)", "from __main__ import segment_string_iter, string, dictionary")
print "iterative", t.timeit(1)
That results in the following output:
recursive result: None
0.0203249454498
iterative result: None
0.0391519069672
That’s a huge improvement over my first version!
So, why would you like to stick with the iterative version? Because of the stack. What happens if we use LEN = 1000 in the above test?
RuntimeError: maximum recursion depth exceeded in cmp
That’s for the recursive version, it crashes. The iterative version needs 28 seconds to finish, but still works.
And that’s all. It’s been a nice exercise for my holidays!
Like this:
Like Loading...
Looks correct to me. :-)
At this point I think you’re entitled to read the original post, which covers the analysis of this solution’s performance and suggests how to improve it.
Thanks for your comment! I’ll read your post again to see how I can improve the performance of my solution.
Nice problem – thanks for sharing, and yes I had fun solving it ;-)
If you want to improve your solution then get rid of the linear search of the dictionary…
I make the running time of the above roughly O(len(dictionary) * 2^n) in the worst case.
Nick: glad you like it :)
Actually I tried to improve worst cases, but I had a problem with recursion.
Try this input:
Play with LEN value, my implementation is really inefficient (30 minutes limit implementation, but anyway!).