MySQL Named Locks in Python Context Managers

I’ve been using MySQL (and recently MariaDB) for many years – it must be something like 14 by now – but every now and then I learn something new about it. Recently, I’ve learned about named locks and how you can use to use your already-there MySQL server as a mean to create distributed locks which are not related to a specific DB transaction.

Here is an example of a Python function who’s internal code will never execute concurrently, even in a multi-process, multi-machine distributed environment, as long as all processes talk to the same MySQL database:

NOTE: this code uses SQLAlchemy-like session semantics, but can easily be applied to any Python MySQL client.

That’s very nice! This code will try to obtain a lock for 5 seconds before resuming execution. If the lock is obtained (meaning no other MySQL client has requested to lock this specific lock), execution will resume and when finished the lock will be released. If the lock cannot be obtained within the given timeout, meaning some other client is currently running this code, an exception will be thrown. In any case the code will never run more than once at any given time. Oh, and MySQL named locks are connection-bound, meaning they are released if the connection dies or is explicitly closed – but again this should not happen while the code is executing, but will keep us safe if the entire program crashes, for example.

Before I knew about this feature, we used to do some custom logical locking in our code (which never feels like a solid solution) or use transaction-level row / table locking which coupled our application’s logic with DB operations too much; MySQL named locks are decoupled of any actual data in your tables – its just a mean to get centralized app-level locking. And while there might be other, more lean mechanisms to achieve that, if you already use MySQL I believe this is a very good solution. To clarify, the Python code between GET_LOCK and RELEASE_LOCK can be anything and does not need to tie in to the database.

However, the code example above is not very clean and has a few disadvantages:

  • It does not handle exceptions properly. If an exception is thrown after the lock was acquired and before it was released, we are most likely going to end up with the lock not being released until the MySQL connection is closed, and we don’t know when that’s going to happen. Not good.
  • No clear separation of concerns – we have a single function that handles both application logic (the part between the locks) and provides the locking implementation. This can be solved in several ways, but I believe the way I’ll demonstrate below to be most elegant.
  • No code reusability, which is somewhat tied to the previous point. We cannot reuse the locking mechanism in other code paths very easily, and need to retype it. We also cannot reuse the application logic between the locks in a non-locked context – or even in unit tests for that matter.

We can solve all these issues very elegantly using the with statement and context managers. These features are one of my favorites idioms more or less unique to the Python programming language, and it’s these sort of features that I believe really help make Python code very clean and elegant without being too verbose.

We’ll start by creating a context manager for MySQL named lock:

And then proceed to use it in our function:

So what does this do? The @contextmanager annotation help us easily create a context-managed resource using generator-like semantics; The wrapper ensures that no matter what happens, the lock is released as we leave the managed context whether it is because the code executed successfully or because an exception was thrown.

The semantics of using the locking_context.named_lock context manager are extremely simple and readable, and reusing the locking context manager is a matter of an import statement and a single line of code. By injecting a mock or monkey-patched object as the first argument of named_lock(), we can also easily test the context manager itself and any code using it. In addition, if we ever need to switch from MySQL-based locking to some other implementation, it can be done more easily.

While the same flow can be achieved in many other languages supporting, for example, try / finally semantics, in most cases I’m aware of one will need to use more complex and less readable flow control structures such as callables to accomplish (yes, if you’re a JavaScript programmer this might make sense to you, but remember that pyramids, while look impressive from the outside, are really tombs with mummies and traps on the inside). I believe it is features like context managers that make Python a language that encourage writing clean code.

Monitoring EC2 instance memory usage with CloudWatch

At Shoppimon we’ve been relying a lot on Amazon infrastructure – it may not be the most cost effective option for larger, more stable companies but for small start-ups that need to be very dynamic, can’t have high up-front costs and don’t have a large IT department its a great choice. We do try to keep our code clean from any vendor-specific APIs, but when it comes to infrastructure & operations management, AWS (with help from tools like Chef) has been great for us.

One of the AWS tools we use is CloudWatch – it allows us to monitor our infrastructure and get alerted when things go wrong. While its not the most flexible monitoring tool out there, it takes care of most of what we need right now and has the advantage of not needing to run an additional server and configure tools such as Nagios or Cacti. With its custom metrics feature, we can even send some app-level data and monitor it using the common Amazon set of tools.

However, there’s one big missing feature in CloudWatch: it doesn’t monitor your instance memory utilization. I suppose Amazon has all sorts of technical reasons not to provide this very important metric out of the box (probably related to the fact that their monitoring is done from outside the instance VM), but really if you need to monitor servers, in addition to CPU load and IO, memory utilization is one of the most important metrics to be aware of.

Continue reading

Generators in PHP 5.5

Now that PHP 5.5 alpha versions are being released, I decided to grab the latest PHP source from GitHub, build it and give the new Generators feature a spin. I have used generators in the past in Python, and was excited to hear they are coming to PHP. While they are useful mostly in advanced use cases they can make a lot of simple use cases much more efficient, and I think its a handy addition to the advanced PHP programmer’s toolbox.

What are Generators?

I like to describe Generators as special functions which are iterable and maintain state. Think of a function that instead of returning once and destroying its state (local variables) after returning, can return multiple times, while maintaining the state of local variables, thus allowing iteration over an instance of that function state. In fact, a call to a generator function creates a special Generator object which can be iterated. The object maintains the internal state of the generator, and on each iteration generates a new value. The same result can be achieved by implementing a Traversable class, but with much less code.

This is very different from the way we are used to think of functions, so maybe an example is the best way to demonstrate this. I will use a simplified example based on the one given in the documentation:


function xrange($start, $end, $step = 1)
{
  for ($i = $start; $i <= $end; $i += $step) {
    yield $i;
  }
}

$start = microtime(true);
foreach (xrange(0, 1000000) as $i) {
  // do nothing
}
$end = microtime(true);

echo "Total time: " . ($end - $start) . " sec\n";
echo "Peak memory usage: " . memory_get_peak_usage() . " bytes\n";

In the example above, the xrange function is a Generator which operates in a similar yet simplified version of the range() PHP function (just like in Python!). The main thing to notice is the yield keyword – this tells the function to yield a value – which means a value is “returned” but the state of the generator is maintained.

When iterating over a generator function, as you can see in the foreach loop, iteration continues as long as a value is yielded. Once the function returns without yielding (as xrange in our example would do once the inner for loop is done), iteration stops. We get a behaviour which is (almost) equivalent to range in the sense that it allows us to iterate over numbers – but, without allocating the entire array of numbers in advance. In our example, we save a lot of memory and in fact execution is faster when a generator is used.

To demonstrate, here is the output of the script above (ok, I added some formatting to the output, but the results are real!):

$ /usr/local/bin/php /tmp/with-generators.php
Total time: 0.20149302482605 sec
Peak memory usage: 234,256 bytes

This is on a one-million integers “array” (unlike range, no real array is allocated so we can’t do random access on members, but during iteration it behaves just like an array).

By comparison, executing the same code with range() instead of xrange(), results in the following:

$ /usr/local/bin/php /tmp/without-generators.php
Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 32 bytes) in /private/tmp/generators.php on line 12

Ok, we reach our memory limit. Lets try to go crazy (not a good idea in production):

$ /usr/local/bin/php -d memory_limit=200M /tmp/without-generators.php
Total time: 0.31754398345947 sec
Peak memory usage: 144,617,256 bytes

After increasing the memory limit to 200 MB, the script runs: but it takes longer (honestly, to my surprise), and consumes an order of magnitude more memory.

Pretty cool, huh?

Just to demonstrate, calling var_dump on a generator would result in this:


var_dump(xrange(0, 100));
// Output:
// object(Generator)#2 (0) {
// }

But I can do the same thing with Iterator interfaces, no?

Yes! pretty much anything you can do with Generators can be done by creating class which implements either the Iterator or IteratorAggregate interfaces. But in many cases, a lot of boilerplate code can be removed if a Generator is used instead. For example, a class equivalent to the xrange generator above would look like this:


class XrangeObject implements Iterator
{
  private $value = 0;
  private $start = 0;
  private $end   = 0;
  private $step  = 1;

  public function __construct($start, $end, $step = 1)
  {
    $this->value = (int) $start;
    $this->start = (int) $start;
    $this->end   = (int) $end;
    $this->step  = (int) $step;
  }

  public function rewind()
  {
    $this->value = $this->start;
  }

  public function current()
  {
    return $this->value;
  }

  public function key()
  {
    return $this->value;
  }

  public function next()
  {
    return ($this->value += $this->step);
  }

  public function valid()
  {
    return $this->value <= $this->end;
  }
}

$start = microtime(true);
$xrange = new XRangeObject(0, 1000000);
foreach ($xrange as $i) {
  // do nothing
}
$end = microtime(true);

echo "Total time: " . ($end - $start) . " sec\n";
echo "Peak memory usage: " . memory_get_peak_usage() . " bytes\n";

Wow, that’s much more code for something we achieved very simply with a generator. BTW, the results are:


$ /usr/local/bin/php /tmp/with-iterator.php
Total time: 0.61971187591553 sec
Peak memory usage: 240,968 bytes

As you can see, memory usage is comparable to a Generator. Run time is more than 3 times slower, but in most realistic use cases this time is usually negligible – in any case unless we would have seen an order of magnitude of difference, performance is not a major issue here. The interesting thing really is the amount of boilerplate code we had to use when creating an iterator – most of this code is just generic boring stuff and not what we really care about. With Generators, the implementation is much shorter.

How about a realistic use case?

Ok, so we have used a generator to iterate over numbers. Woopti-doo. We can just drop the generator and use the for loop inside it to achieve the same thing. How about a more realistic use case?

Take a look at the following example, which I believe can be pretty useful and still has fairly straightforward code: a generator which combines the efficiency of XMLReader with the simple API of SimpleXML to bring you an efficnet yet easy to use XML reader function for possibly large XML streams with repeating structure – for example, RSS or Atom feeds.


function xml_stream_reader($url, $element)
{
  $reader = new XMLReader();
  $reader->open($url);

  while (true) {
    // Skip to next element
    while (! ($reader->nodeType == XMLReader::ELEMENT && $reader->name == $element)) {
      if (! $reader->read()) break(2);
    }

    if ($reader->nodeType == XMLReader::ELEMENT && $reader->name == $element) {
      yield simplexml_load_string($reader->readOuterXml());
      $reader->next();
    }
  }
}

The xml_stream_reader() generator defined above will use XMLReader to open and read from an XML stream. Unlike PHP’s SimpleXML or DOM extensions, it will not read an entire XML document into memory, thus avoiding potential blowups on very large XML files. To keep things simple for the user however, whenever it encounters the XML element searched by the user (e.g. the item element in RSS feeds), it will read the entire element into memory (assume each item is small but there are potentially thousands of items) and return it as a SimpleXMLElement object – thus still providing the ease of use of SimpleXML for the consumer.

Here is how it can be used:


$feed = xml_stream_reader('http://news.google.com/?output=rss&num=100', 'item');
foreach($feed as $itemXml) {
  echo $itemXml->title . "\n";
}

While I couldn’t find a large-enough XML file to test this on, even with 2mb files, this can be much more efficient than DOM or SimpleXML, and without too much more coding.

So I’m really happy about the addition of generators – it’s a cool feature. Not one you’d use every day, but in some places where complex Iterators had to be implemented (and where OO features such as polymorphism are not required), generators can be a real neat, concise and maintainable solution.