en_GB@blog

Open Source and Other Things

Tag Archives: Python

PyWeek 13

I’ve registered into next PyWeek 13 as a solo entry (registration is still open, ping me if you want to join!), although it starts the same Sunday the Barcamp Apache Oxford 11 ends.

I’ll try to finish something, using a couple of hours after work every day, and more intensively on the weekend. The theme will be voted tomorrow, but anyway I’m thinking in different strategies and possible game mechanics (besides reading some pygame docs).

I’m planning writing some posts on my entry, but that will be after the contest starts 00:00 UTC 2011-09-11.

New versions of ftp-cloudfs and sftp-cloudfs

Last week we (the dev team of Memset) released a new version of different open source projects which objective is to provide a convenient file system-alike interface for OpenStack Object Storage (codename Swift).

As you may know, Swift is a object storage and not a file system or real-time data storage system. To explain it short: you can store, retrieve and delete objects over a RESTful HTTP API, but not all the operations that you would expect on a real file system are available.

Swift works very well as a long-term storage or when you need to store static data, and its API is compatible with Rackspace Cloud Files, so you have several applications that can communicate with it out of the box (or even with explicit support, for example with Cyberduck and OpenStack Object Storage protocol).

When we started working in a product based on Swift (to be released soon, stay tuned!), we found that the pseudo-hierarchical directories support on Swift has lots of possibilities, and… wouldn’t it be great to use any of your favourite file transfer applications directly on Swift?

Nick Craig-Wood (Memset’s Technical Director) started looking for any open source doing that, and we found the project of Chmouel Boudjnah: ftp-cloudfs. We were more interested in a SFTP interface, but Chomuel approach was very nice (MIT licensed), so we decided to put some work on his code, writing a fake file system interface layer to make some specific applications work over FTP.

After some love (other people call this QA), ftp-cloudfs has reached version 0.9, and we believe it’s in good shape to be used in a real-world™ environment.

At the same time we started to work in a SFTP implementation of the same idea, using Paramiko to deal with the protocol details, and thanks to reusing some code from ftp-cloudfs, we promptly released sftp-cloudfs as open source.

We’re planning to refactorize the code and move out the file system layer to a different project (right now sftp-cloudfs depends on ftp-cloudfs, which is not optimal), but as today I think we have reached an important milestone in both projects.

With this release we also added the packages to PyPi to make as easy as possible to start playing with the daemons, including our most crazy project to provide a full file system interface through FUSE: pycloudfuse.

So if you want to give them a go you can download the code, or try with pip:

$ pip install ftp-cloudfs

and for the SFTP code:

$ pip install sftp-cloudfs

Note: pycloudfuse mostly works, but it’s not in the same good shape than the other two projects and it needs more love (again, talking about QA here). Any help is welcome!

The “word break” problem

From Retiring a Great Interview Problem (via Retiring a Great Interview Problem), this is the best I could get in less than 30 minutes:

def segment_string(string, dictionary):
	def split(string, dictionary):
		if string == '':
			return []
		for w in dictionary:
			if string.startswith(w):
				partial = split(string[len(w):], dictionary)
				if None in partial:
					continue
				return [w,] + partial
		return [None, ]

	result = split(string, dictionary)
	if None in result:
		return None
	else:
		return ' '.join(result)

if __name__=="__main__":
	dictionary = [ 'pasty', 'app', 'apple', 'pi', 'pie', ]
	print segment_string('applepie', dictionary)

My solution is recursive with backtracking and works with n splits (general case). I can’t make a proper big O analysis though. In fact I’m not completely sure the solution is correct, but you know that developers are nothing if not compulsive problem solvers, so I had to try.

Update (updated): this is a iterative solution using backtracking and it’s supposed to be equivalent to the recursive code:

def segment_string_iter(string, dictionary):
	solution = []
	back = []
	start = 0
	while string != '':
		for i in range(start, len(dictionary)):
			w = dictionary[i]
			if string.startswith(w):
				solution.append(w)
				back.append([string, i+1])
				string = string[len(w):]
				start = 0
				break
		else:
			if not len(back):
				return None
			solution.pop()
			string, start = back.pop()
	return ' '.join(solution)

When I was trying to improve the performance of my solution I thought I found something interesting, but I was wrong (the iterative code had a bug). Now I’ll try to improve worst cases on the iterative code (if I don’t find another bug first!).

Update 2 (memoization): finally I got the iterative version to work with memoization, which seems to be a cool word to say caching.

Adding memoization to the recursive version of the code resulted to be very easy:

def segment_string(string, dictionary):
	def split(string, dictionary):
		if string == '':
			return []
		if string in cache.keys():
			return cache[string]
		for w in dictionary:
			if string.startswith(w):
				partial = split(string[len(w):], dictionary)
				if None in partial:
					cache[string] = [None, ]
					continue
				result = [w,] + partial
				cache[string] = result
				return result
		return [None, ]

	cache = {}
	result = split(string, dictionary)
	if None in result:
		return None
	else:
		return ' '.join(result)

Basically we cache all the partial results, and when we find a string we already know, we can get that result from the cache. Because the function is recursive, the new functionality was easy to add.

Well, I had some problems to get the iterative version right because I tried to cache all results at first (goods and dead ends), but then I realized that was useless because we want to speed up the worst cases and that means the dead ends only in the decision tree.

So that’s what I’m doing in the following code:

def segment_string_iter(string, dictionary):
	solution = []
	back = []
	start = 0
	cache = {}
	while string != '':
		if string in cache.keys():
			if not len(back):
				return None
			else:
				solution.pop()
				string, start = back.pop()
				continue
		for i in range(start, len(dictionary)):
			w = dictionary[i]
			if string.startswith(w):
				solution.append(w)
				back.append([string, i+1])
				string = string[len(w):]
				start = 0
				break
		else:
			if not len(back):
				return None
			cache[string] = None
			solution.pop()
			string, start = back.pop()
	return ' '.join(solution)

The iterative solution looks more complicated (and it is), although is equivalent for worst cases:

LEN = 100
dictionary = ['a'*i for i in range(1, LEN+1)]
string = 'a'*(LEN-1) + 'b'

t = timeit.Timer("print 'result:', segment_string(string, dictionary)", "from __main__ import segment_string, string, dictionary")
print "recursive", t.timeit(1)

t = timeit.Timer("print 'result:', segment_string_iter(string, dictionary)", "from __main__ import segment_string_iter, string, dictionary")
print "iterative", t.timeit(1)

That results in the following output:

recursive result: None
0.0203249454498
iterative result: None
0.0391519069672

That’s a huge improvement over my first version!

So, why would you like to stick with the iterative version? Because of the stack. What happens if we use LEN = 1000 in the above test?

RuntimeError: maximum recursion depth exceeded in cmp

That’s for the recursive version, it crashes. The iterative version needs 28 seconds to finish, but still works.

And that’s all. It’s been a nice exercise for my holidays!

Follow

Get every new post delivered to your Inbox.