en_GB@blog

Open Source and Other Things

The “word break” problem

From Retiring a Great Interview Problem (via Retiring a Great Interview Problem), this is the best I could get in less than 30 minutes:

def segment_string(string, dictionary):
	def split(string, dictionary):
		if string == '':
			return []
		for w in dictionary:
			if string.startswith(w):
				partial = split(string[len(w):], dictionary)
				if None in partial:
					continue
				return [w,] + partial
		return [None, ]

	result = split(string, dictionary)
	if None in result:
		return None
	else:
		return ' '.join(result)

if __name__=="__main__":
	dictionary = [ 'pasty', 'app', 'apple', 'pi', 'pie', ]
	print segment_string('applepie', dictionary)

My solution is recursive with backtracking and works with n splits (general case). I can’t make a proper big O analysis though. In fact I’m not completely sure the solution is correct, but you know that developers are nothing if not compulsive problem solvers, so I had to try.

Update (updated): this is a iterative solution using backtracking and it’s supposed to be equivalent to the recursive code:

def segment_string_iter(string, dictionary):
	solution = []
	back = []
	start = 0
	while string != '':
		for i in range(start, len(dictionary)):
			w = dictionary[i]
			if string.startswith(w):
				solution.append(w)
				back.append([string, i+1])
				string = string[len(w):]
				start = 0
				break
		else:
			if not len(back):
				return None
			solution.pop()
			string, start = back.pop()
	return ' '.join(solution)

When I was trying to improve the performance of my solution I thought I found something interesting, but I was wrong (the iterative code had a bug). Now I’ll try to improve worst cases on the iterative code (if I don’t find another bug first!).

Update 2 (memoization): finally I got the iterative version to work with memoization, which seems to be a cool word to say caching.

Adding memoization to the recursive version of the code resulted to be very easy:

def segment_string(string, dictionary):
	def split(string, dictionary):
		if string == '':
			return []
		if string in cache.keys():
			return cache[string]
		for w in dictionary:
			if string.startswith(w):
				partial = split(string[len(w):], dictionary)
				if None in partial:
					cache[string] = [None, ]
					continue
				result = [w,] + partial
				cache[string] = result
				return result
		return [None, ]

	cache = {}
	result = split(string, dictionary)
	if None in result:
		return None
	else:
		return ' '.join(result)

Basically we cache all the partial results, and when we find a string we already know, we can get that result from the cache. Because the function is recursive, the new functionality was easy to add.

Well, I had some problems to get the iterative version right because I tried to cache all results at first (goods and dead ends), but then I realized that was useless because we want to speed up the worst cases and that means the dead ends only in the decision tree.

So that’s what I’m doing in the following code:

def segment_string_iter(string, dictionary):
	solution = []
	back = []
	start = 0
	cache = {}
	while string != '':
		if string in cache.keys():
			if not len(back):
				return None
			else:
				solution.pop()
				string, start = back.pop()
				continue
		for i in range(start, len(dictionary)):
			w = dictionary[i]
			if string.startswith(w):
				solution.append(w)
				back.append([string, i+1])
				string = string[len(w):]
				start = 0
				break
		else:
			if not len(back):
				return None
			cache[string] = None
			solution.pop()
			string, start = back.pop()
	return ' '.join(solution)

The iterative solution looks more complicated (and it is), although is equivalent for worst cases:

LEN = 100
dictionary = ['a'*i for i in range(1, LEN+1)]
string = 'a'*(LEN-1) + 'b'

t = timeit.Timer("print 'result:', segment_string(string, dictionary)", "from __main__ import segment_string, string, dictionary")
print "recursive", t.timeit(1)

t = timeit.Timer("print 'result:', segment_string_iter(string, dictionary)", "from __main__ import segment_string_iter, string, dictionary")
print "iterative", t.timeit(1)

That results in the following output:

recursive result: None
0.0203249454498
iterative result: None
0.0391519069672

That’s a huge improvement over my first version!

So, why would you like to stick with the iterative version? Because of the stack. What happens if we use LEN = 1000 in the above test?

RuntimeError: maximum recursion depth exceeded in cmp

That’s for the recursive version, it crashes. The iterative version needs 28 seconds to finish, but still works.

And that’s all. It’s been a nice exercise for my holidays!

Problems with Google Reader

I stopped using Liferea after years and years of almost complete satisfaction, mostly because the SQLite storage it’s a little bit clumsy and I started to get annoyed by the HDD making noise every RSS update (and the UI freezes, and it uses lots of CPU, etc).

So I decided to give Google Reader a try, and it’s not even close to Liferea’s experience. Actually it looks to me like it’s a good interface to scan through the information instead of a good interface for reading.

The web interface it’s not as fast and responsive as the native interface, and it even forced me to change the way I read my feeds, to the point that I’ve found that I check for new content less often and it seems to be very difficult not having about 500 unread entries (that can happen with Liferea too, but it was in the 100 magnitude).

The interesting side effect is that I’m doing other things, which I’m not sure that are more productive, but Google Reader it’s definitely not useful to get yourself up to date with your feeds. May be it’s better that way.

Memset on OpenStack first anniversary

MemSet 1 Year Birthday OSCON from OpenStack on Vimeo.

It could be better, but it could be worse too! Anyway I think we made a couple of interesting points, didn’t we?

Open Cloud Initiative Launched!

Today at the OSCON11 the Open Cloud Initiative (OCI) was launched as a non-porofit organization to advocate standards-based Open Cloud Computing.

From the official announcement:

The Open Cloud Initiative (OCI) has launched its official website at http://www.opencloudinitiative.org/ and commenced a 30-day final comment period on the Open Cloud Principles (OCP), which are designed to ensure user freedoms without impeding the ability of providers to do business. They are focused on interoperability, avoiding barriers to entry or exit, ensuring technological neutrality and forbidding discrimination. They define the specific requirements for Open Standards and mandate their use for formats and interfaces, calling for “multiple full, faithful and interoperable implementations”, at least one of which being Open Source.

This is a big step forward in the direction that the Franklin Street Statement pointed out three years ago. Check current status of the Open Cloud Principles (OCP), and then look for similarities with the Franklin Street Statement.

This is really interesting and I’m looking forward to the development of the initiative.

Hello world!

Hey! Look out! I have a new home.

I’ve been blogging for two years and a half in Ramble On, and finally I felt like moving from Tumblr to a real blogging platform.

The old posts will stay there, mainly because Tumblr doesn’t provide any way for exporting the data. I knew that when I signed up, but you know when something is meant to be an experiment and… two years later you have a brand new blog at wordpress.com.

I’ll see you around.

Follow

Get every new post delivered to your Inbox.