en_GB@blog

Open Source and Other Things

Category Archives: Programming

PyWeek 13

I’ve registered into next PyWeek 13 as a solo entry (registration is still open, ping me if you want to join!), although it starts the same Sunday the Barcamp Apache Oxford 11 ends.

I’ll try to finish something, using a couple of hours after work every day, and more intensively on the weekend. The theme will be voted tomorrow, but anyway I’m thinking in different strategies and possible game mechanics (besides reading some pygame docs).

I’m planning writing some posts on my entry, but that will be after the contest starts 00:00 UTC 2011-09-11.

Reasons to hate Javascript (2)

Javascript arrays are probably reason enough to keep yourself away from the language, but anyway: sort compares strings by default and ordering an array of numbers will result in something unexpected, unless you provide your own comparison function:

// open firebug console to see the output
var arry = [ 42, 13, 6, 9, 12, 58, 1 ];

// by default it treats the numbers as strings
console.log(arry.sort()) // [1, 12, 13, 42, 58, 6, 9]

// providing a comparison function
console.log(arry.sort(function (a,b) {
        if (a>b) {
                return 1;
        } else {
                if (a<b) {
                        return -1;      
                } else {
                        return 0;       
                }
        }
        // or return a-b
})); // [1, 6, 9, 12, 13, 42, 58]

I don’t know if we could include this in the Javascript arrays are a non-sense box, but it’s definitely an odd behaviour.

Reasons to hate Javascript

The length property in an array:

// check the output on firebug console
var arry = [];

console.log(arry.length); // 0

arry[100] = 'one element';
console.log(arry.length); // 101

for(idx in arry) {
	console.log(arry[idx]);
}
// one element

Yes, the array has only one element and length is showing the largest integer property name in the array, plus one.

The “word break” problem

From Retiring a Great Interview Problem (via Retiring a Great Interview Problem), this is the best I could get in less than 30 minutes:

def segment_string(string, dictionary):
	def split(string, dictionary):
		if string == '':
			return []
		for w in dictionary:
			if string.startswith(w):
				partial = split(string[len(w):], dictionary)
				if None in partial:
					continue
				return [w,] + partial
		return [None, ]

	result = split(string, dictionary)
	if None in result:
		return None
	else:
		return ' '.join(result)

if __name__=="__main__":
	dictionary = [ 'pasty', 'app', 'apple', 'pi', 'pie', ]
	print segment_string('applepie', dictionary)

My solution is recursive with backtracking and works with n splits (general case). I can’t make a proper big O analysis though. In fact I’m not completely sure the solution is correct, but you know that developers are nothing if not compulsive problem solvers, so I had to try.

Update (updated): this is a iterative solution using backtracking and it’s supposed to be equivalent to the recursive code:

def segment_string_iter(string, dictionary):
	solution = []
	back = []
	start = 0
	while string != '':
		for i in range(start, len(dictionary)):
			w = dictionary[i]
			if string.startswith(w):
				solution.append(w)
				back.append([string, i+1])
				string = string[len(w):]
				start = 0
				break
		else:
			if not len(back):
				return None
			solution.pop()
			string, start = back.pop()
	return ' '.join(solution)

When I was trying to improve the performance of my solution I thought I found something interesting, but I was wrong (the iterative code had a bug). Now I’ll try to improve worst cases on the iterative code (if I don’t find another bug first!).

Update 2 (memoization): finally I got the iterative version to work with memoization, which seems to be a cool word to say caching.

Adding memoization to the recursive version of the code resulted to be very easy:

def segment_string(string, dictionary):
	def split(string, dictionary):
		if string == '':
			return []
		if string in cache.keys():
			return cache[string]
		for w in dictionary:
			if string.startswith(w):
				partial = split(string[len(w):], dictionary)
				if None in partial:
					cache[string] = [None, ]
					continue
				result = [w,] + partial
				cache[string] = result
				return result
		return [None, ]

	cache = {}
	result = split(string, dictionary)
	if None in result:
		return None
	else:
		return ' '.join(result)

Basically we cache all the partial results, and when we find a string we already know, we can get that result from the cache. Because the function is recursive, the new functionality was easy to add.

Well, I had some problems to get the iterative version right because I tried to cache all results at first (goods and dead ends), but then I realized that was useless because we want to speed up the worst cases and that means the dead ends only in the decision tree.

So that’s what I’m doing in the following code:

def segment_string_iter(string, dictionary):
	solution = []
	back = []
	start = 0
	cache = {}
	while string != '':
		if string in cache.keys():
			if not len(back):
				return None
			else:
				solution.pop()
				string, start = back.pop()
				continue
		for i in range(start, len(dictionary)):
			w = dictionary[i]
			if string.startswith(w):
				solution.append(w)
				back.append([string, i+1])
				string = string[len(w):]
				start = 0
				break
		else:
			if not len(back):
				return None
			cache[string] = None
			solution.pop()
			string, start = back.pop()
	return ' '.join(solution)

The iterative solution looks more complicated (and it is), although is equivalent for worst cases:

LEN = 100
dictionary = ['a'*i for i in range(1, LEN+1)]
string = 'a'*(LEN-1) + 'b'

t = timeit.Timer("print 'result:', segment_string(string, dictionary)", "from __main__ import segment_string, string, dictionary")
print "recursive", t.timeit(1)

t = timeit.Timer("print 'result:', segment_string_iter(string, dictionary)", "from __main__ import segment_string_iter, string, dictionary")
print "iterative", t.timeit(1)

That results in the following output:

recursive result: None
0.0203249454498
iterative result: None
0.0391519069672

That’s a huge improvement over my first version!

So, why would you like to stick with the iterative version? Because of the stack. What happens if we use LEN = 1000 in the above test?

RuntimeError: maximum recursion depth exceeded in cmp

That’s for the recursive version, it crashes. The iterative version needs 28 seconds to finish, but still works.

And that’s all. It’s been a nice exercise for my holidays!

Follow

Get every new post delivered to your Inbox.