en_GB@blog

Open Source and Other Things

Tag Archives: problem

The “word break” problem

From Retiring a Great Interview Problem (via Retiring a Great Interview Problem), this is the best I could get in less than 30 minutes:

def segment_string(string, dictionary):
	def split(string, dictionary):
		if string == '':
			return []
		for w in dictionary:
			if string.startswith(w):
				partial = split(string[len(w):], dictionary)
				if None in partial:
					continue
				return [w,] + partial
		return [None, ]

	result = split(string, dictionary)
	if None in result:
		return None
	else:
		return ' '.join(result)

if __name__=="__main__":
	dictionary = [ 'pasty', 'app', 'apple', 'pi', 'pie', ]
	print segment_string('applepie', dictionary)

My solution is recursive with backtracking and works with n splits (general case). I can’t make a proper big O analysis though. In fact I’m not completely sure the solution is correct, but you know that developers are nothing if not compulsive problem solvers, so I had to try.

Update (updated): this is a iterative solution using backtracking and it’s supposed to be equivalent to the recursive code:

def segment_string_iter(string, dictionary):
	solution = []
	back = []
	start = 0
	while string != '':
		for i in range(start, len(dictionary)):
			w = dictionary[i]
			if string.startswith(w):
				solution.append(w)
				back.append([string, i+1])
				string = string[len(w):]
				start = 0
				break
		else:
			if not len(back):
				return None
			solution.pop()
			string, start = back.pop()
	return ' '.join(solution)

When I was trying to improve the performance of my solution I thought I found something interesting, but I was wrong (the iterative code had a bug). Now I’ll try to improve worst cases on the iterative code (if I don’t find another bug first!).

Update 2 (memoization): finally I got the iterative version to work with memoization, which seems to be a cool word to say caching.

Adding memoization to the recursive version of the code resulted to be very easy:

def segment_string(string, dictionary):
	def split(string, dictionary):
		if string == '':
			return []
		if string in cache.keys():
			return cache[string]
		for w in dictionary:
			if string.startswith(w):
				partial = split(string[len(w):], dictionary)
				if None in partial:
					cache[string] = [None, ]
					continue
				result = [w,] + partial
				cache[string] = result
				return result
		return [None, ]

	cache = {}
	result = split(string, dictionary)
	if None in result:
		return None
	else:
		return ' '.join(result)

Basically we cache all the partial results, and when we find a string we already know, we can get that result from the cache. Because the function is recursive, the new functionality was easy to add.

Well, I had some problems to get the iterative version right because I tried to cache all results at first (goods and dead ends), but then I realized that was useless because we want to speed up the worst cases and that means the dead ends only in the decision tree.

So that’s what I’m doing in the following code:

def segment_string_iter(string, dictionary):
	solution = []
	back = []
	start = 0
	cache = {}
	while string != '':
		if string in cache.keys():
			if not len(back):
				return None
			else:
				solution.pop()
				string, start = back.pop()
				continue
		for i in range(start, len(dictionary)):
			w = dictionary[i]
			if string.startswith(w):
				solution.append(w)
				back.append([string, i+1])
				string = string[len(w):]
				start = 0
				break
		else:
			if not len(back):
				return None
			cache[string] = None
			solution.pop()
			string, start = back.pop()
	return ' '.join(solution)

The iterative solution looks more complicated (and it is), although is equivalent for worst cases:

LEN = 100
dictionary = ['a'*i for i in range(1, LEN+1)]
string = 'a'*(LEN-1) + 'b'

t = timeit.Timer("print 'result:', segment_string(string, dictionary)", "from __main__ import segment_string, string, dictionary")
print "recursive", t.timeit(1)

t = timeit.Timer("print 'result:', segment_string_iter(string, dictionary)", "from __main__ import segment_string_iter, string, dictionary")
print "iterative", t.timeit(1)

That results in the following output:

recursive result: None
0.0203249454498
iterative result: None
0.0391519069672

That’s a huge improvement over my first version!

So, why would you like to stick with the iterative version? Because of the stack. What happens if we use LEN = 1000 in the above test?

RuntimeError: maximum recursion depth exceeded in cmp

That’s for the recursive version, it crashes. The iterative version needs 28 seconds to finish, but still works.

And that’s all. It’s been a nice exercise for my holidays!

Follow

Get every new post delivered to your Inbox.