Rev 4197: (jam) Improvements to LRUCache structure, use a double-linked-list in file:///home/pqm/archives/thelove/bzr/%2Btrunk/
Canonical.com Patch Queue Manager
pqm at pqm.ubuntu.com
Tue Mar 24 17:01:55 GMT 2009
At file:///home/pqm/archives/thelove/bzr/%2Btrunk/
------------------------------------------------------------
revno: 4197
revision-id: pqm at pqm.ubuntu.com-20090324170150-9wtdpv5w7192zdwy
parent: pqm at pqm.ubuntu.com-20090324134038-bfrsas8eleeoiv0a
parent: john at arbash-meinel.com-20090324142828-69pxzmujy9mxp7ge
committer: Canonical.com Patch Queue Manager <pqm at pqm.ubuntu.com>
branch nick: +trunk
timestamp: Tue 2009-03-24 17:01:50 +0000
message:
(jam) Improvements to LRUCache structure, use a double-linked-list
modified:
bzrlib/lru_cache.py lru_cache.py-20070119165515-tlw203kuwh0id5gv-1
bzrlib/tests/test_lru_cache.py test_lru_cache.py-20070119165535-hph6rk4h9rzy4180-1
------------------------------------------------------------
revno: 4178.3.7
revision-id: john at arbash-meinel.com-20090324142828-69pxzmujy9mxp7ge
parent: john at arbash-meinel.com-20090323022806-snn93cbemn82w1b0
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: lru_cache_linked_lst
timestamp: Tue 2009-03-24 09:28:28 -0500
message:
Review tweaks from Ian.
modified:
bzrlib/lru_cache.py lru_cache.py-20070119165515-tlw203kuwh0id5gv-1
bzrlib/tests/test_lru_cache.py test_lru_cache.py-20070119165535-hph6rk4h9rzy4180-1
------------------------------------------------------------
revno: 4178.3.6
revision-id: john at arbash-meinel.com-20090323022806-snn93cbemn82w1b0
parent: john at arbash-meinel.com-20090322193045-te9xf2ufvsnq09fl
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: lru_cache_linked_lst
timestamp: Sun 2009-03-22 21:28:06 -0500
message:
Remove the asserts, and change some to AssertionError.
modified:
bzrlib/lru_cache.py lru_cache.py-20070119165515-tlw203kuwh0id5gv-1
------------------------------------------------------------
revno: 4178.3.5
revision-id: john at arbash-meinel.com-20090322193045-te9xf2ufvsnq09fl
parent: john at arbash-meinel.com-20090322191330-bbueodlalhf0onzh
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: lru_cache_linked_lst
timestamp: Sun 2009-03-22 14:30:45 -0500
message:
Add tests that LRUCache.get() properly tracks accesses.
modified:
bzrlib/lru_cache.py lru_cache.py-20070119165515-tlw203kuwh0id5gv-1
bzrlib/tests/test_lru_cache.py test_lru_cache.py-20070119165535-hph6rk4h9rzy4180-1
------------------------------------------------------------
revno: 4178.3.4
revision-id: john at arbash-meinel.com-20090322191330-bbueodlalhf0onzh
parent: john at arbash-meinel.com-20090322190503-wabx7fjrgarxx9c5
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: lru_cache_linked_lst
timestamp: Sun 2009-03-22 14:13:30 -0500
message:
Shave off approx 100ms by inlining _record_access into __getitem__,
also, play some tricks with member access, so that we don't pay double lookup costs
when we can avoid it.
modified:
bzrlib/lru_cache.py lru_cache.py-20070119165515-tlw203kuwh0id5gv-1
------------------------------------------------------------
revno: 4178.3.3
revision-id: john at arbash-meinel.com-20090322190503-wabx7fjrgarxx9c5
parent: john at arbash-meinel.com-20090322170709-r389wogofbe8n6d9
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: lru_cache_linked_lst
timestamp: Sun 2009-03-22 14:05:03 -0500
message:
LRUCache is now implemented with a dict to a linked list,
rather than multiple dicts.
This turns out to be rather good for memory savings, and even a little
bit faster. Proper data structures FTW.
Still seeing ~500ms overhead versus a FIFO or dict, but some of that
is just the overhead you expect from having to maintain the LRU info.
modified:
bzrlib/lru_cache.py lru_cache.py-20070119165515-tlw203kuwh0id5gv-1
bzrlib/tests/test_lru_cache.py test_lru_cache.py-20070119165535-hph6rk4h9rzy4180-1
------------------------------------------------------------
revno: 4178.3.2
revision-id: john at arbash-meinel.com-20090322170709-r389wogofbe8n6d9
parent: john at arbash-meinel.com-20090322170012-wmdv58057nxe4s9u
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: lru_cache_linked_lst
timestamp: Sun 2009-03-22 12:07:09 -0500
message:
Add tests for LRUCache.cache_size()
modified:
bzrlib/tests/test_lru_cache.py test_lru_cache.py-20070119165535-hph6rk4h9rzy4180-1
------------------------------------------------------------
revno: 4178.3.1
revision-id: john at arbash-meinel.com-20090322170012-wmdv58057nxe4s9u
parent: pqm at pqm.ubuntu.com-20090320192036-455rjm03qqnr818d
committer: John Arbash Meinel <john at arbash-meinel.com>
branch nick: lru_cache_linked_lst
timestamp: Sun 2009-03-22 12:00:12 -0500
message:
Implement LRUCache.cache_size(), so that it can trivially substitute for FIFOCache.
modified:
bzrlib/lru_cache.py lru_cache.py-20070119165515-tlw203kuwh0id5gv-1
=== modified file 'bzrlib/lru_cache.py'
--- a/bzrlib/lru_cache.py 2009-03-23 14:59:43 +0000
+++ b/bzrlib/lru_cache.py 2009-03-24 17:01:50 +0000
@@ -16,9 +16,46 @@
"""A simple least-recently-used (LRU) cache."""
-from collections import deque
-
-from bzrlib import symbol_versioning
+from bzrlib import (
+ symbol_versioning,
+ trace,
+ )
+
+
+class _LRUNode(object):
+ """This maintains the linked-list which is the lru internals."""
+
+ __slots__ = ('prev', 'next', 'key', 'value', 'cleanup', 'size')
+
+ def __init__(self, key, value, cleanup=None):
+ self.prev = None
+ self.next = None
+ self.key = key
+ self.value = value
+ self.cleanup = cleanup
+ # TODO: We could compute this 'on-the-fly' like we used to, and remove
+ # one pointer from this object, we just need to decide if it
+ # actually costs us much of anything in normal usage
+ self.size = None
+
+ def __repr__(self):
+ if self.next is None:
+ next = None
+ else:
+ next = self.next.key
+ if self.prev is None:
+ prev = None
+ else:
+ prev = self.prev.key
+ return '%s(%r n:%r p:%r)' % (self.__class__.__name__, self.key,
+ next, prev)
+
+ def run_cleanup(self):
+ if self.cleanup is not None:
+ self.cleanup(self.key, self.value)
+ self.cleanup = None
+ # Just make sure to break any refcycles, etc
+ self.value = None
class LRUCache(object):
@@ -33,48 +70,108 @@
DeprecationWarning)
after_cleanup_count = after_cleanup_size
self._cache = {}
- self._cleanup = {}
- self._queue = deque() # Track when things are accessed
- self._refcount = {} # number of entries in self._queue for each key
+ # The "HEAD" of the lru linked list
+ self._most_recently_used = None
+ # The "TAIL" of the lru linked list
+ self._last_recently_used = None
self._update_max_cache(max_cache, after_cleanup_count)
def __contains__(self, key):
return key in self._cache
def __getitem__(self, key):
- val = self._cache[key]
- self._record_access(key)
- return val
+ node = self._cache[key]
+ # Inlined from _record_access to decrease the overhead of __getitem__
+ # We also have more knowledge about structure if __getitem__ is
+ # succeeding, then we know that self._most_recently_used must not be
+ # None, etc.
+ mru = self._most_recently_used
+ if node is mru:
+ # Nothing to do, this node is already at the head of the queue
+ return node.value
+ elif node is self._last_recently_used:
+ self._last_recently_used = node.prev
+ # Remove this node from the old location
+ node_prev = node.prev
+ node_next = node.next
+ if node_prev is not None:
+ node_prev.next = node_next
+ if node_next is not None:
+ node_next.prev = node_prev
+ # Insert this node to the front of the list
+ node.next = mru
+ mru.prev = node
+ self._most_recently_used = node
+ node.prev = None
+ return node.value
def __len__(self):
return len(self._cache)
+ def _walk_lru(self):
+ """Walk the LRU list, only meant to be used in tests."""
+ node = self._most_recently_used
+ if node is not None:
+ if node.prev is not None:
+ raise AssertionError('the _most_recently_used entry is not'
+ ' supposed to have a previous entry'
+ ' %s' % (node,))
+ while node is not None:
+ if node.next is None:
+ if node is not self._last_recently_used:
+ raise AssertionError('only the last node should have'
+ ' no next value: %s' % (node,))
+ else:
+ if node.next.prev is not node:
+ raise AssertionError('inconsistency found, node.next.prev'
+ ' != node: %s' % (node,))
+ if node.prev is None:
+ if node is not self._most_recently_used:
+ raise AssertionError('only the _most_recently_used should'
+ ' not have a previous node: %s'
+ % (node,))
+ else:
+ if node.prev.next is not node:
+ raise AssertionError('inconsistency found, node.prev.next'
+ ' != node: %s' % (node,))
+ yield node
+ node = node.next
+
def add(self, key, value, cleanup=None):
"""Add a new value to the cache.
- Also, if the entry is ever removed from the queue, call cleanup.
- Passing it the key and value being removed.
+ Also, if the entry is ever removed from the cache, call
+ cleanup(key, value).
:param key: The key to store it under
:param value: The object to store
:param cleanup: None or a function taking (key, value) to indicate
- 'value' sohuld be cleaned up.
+ 'value' should be cleaned up.
"""
if key in self._cache:
- self._remove(key)
- self._cache[key] = value
- if cleanup is not None:
- self._cleanup[key] = cleanup
- self._record_access(key)
+ node = self._cache[key]
+ node.run_cleanup()
+ node.value = value
+ node.cleanup = cleanup
+ else:
+ node = _LRUNode(key, value, cleanup=cleanup)
+ self._cache[key] = node
+ self._record_access(node)
if len(self._cache) > self._max_cache:
# Trigger the cleanup
self.cleanup()
+ def cache_size(self):
+ """Get the number of entries we will cache."""
+ return self._max_cache
+
def get(self, key, default=None):
- if key in self._cache:
- return self[key]
- return default
+ node = self._cache.get(key, None)
+ if node is None:
+ return default
+ self._record_access(node)
+ return node.value
def keys(self):
"""Get the list of keys currently cached.
@@ -87,6 +184,10 @@
"""
return self._cache.keys()
+ def items(self):
+ """Get the key:value pairs as a dict."""
+ return dict((k, n.value) for k, n in self._cache.iteritems())
+
def cleanup(self):
"""Clear the cache until it shrinks to the requested size.
@@ -96,45 +197,44 @@
# Make sure the cache is shrunk to the correct size
while len(self._cache) > self._after_cleanup_count:
self._remove_lru()
- # No need to compact the queue at this point, because the code that
- # calls this would have already triggered it based on queue length
def __setitem__(self, key, value):
"""Add a value to the cache, there will be no cleanup function."""
self.add(key, value, cleanup=None)
- def _record_access(self, key):
+ def _record_access(self, node):
"""Record that key was accessed."""
- self._queue.append(key)
- # Can't use setdefault because you can't += 1 the result
- self._refcount[key] = self._refcount.get(key, 0) + 1
-
- # If our access queue is too large, clean it up too
- if len(self._queue) > self._compact_queue_length:
- self._compact_queue()
-
- def _compact_queue(self):
- """Compact the queue, leaving things in sorted last appended order."""
- new_queue = deque()
- for item in self._queue:
- if self._refcount[item] == 1:
- new_queue.append(item)
- else:
- self._refcount[item] -= 1
- self._queue = new_queue
- # All entries should be of the same size. There should be one entry in
- # queue for each entry in cache, and all refcounts should == 1
- if not (len(self._queue) == len(self._cache) ==
- len(self._refcount) == sum(self._refcount.itervalues())):
- raise AssertionError()
-
- def _remove(self, key):
- """Remove an entry, making sure to maintain the invariants."""
- cleanup = self._cleanup.pop(key, None)
- val = self._cache.pop(key)
- if cleanup is not None:
- cleanup(key, val)
- return val
+ # Move 'node' to the front of the queue
+ if self._most_recently_used is None:
+ self._most_recently_used = node
+ self._last_recently_used = node
+ return
+ elif node is self._most_recently_used:
+ # Nothing to do, this node is already at the head of the queue
+ return
+ elif node is self._last_recently_used:
+ self._last_recently_used = node.prev
+ # We've taken care of the tail pointer, remove the node, and insert it
+ # at the front
+ # REMOVE
+ if node.prev is not None:
+ node.prev.next = node.next
+ if node.next is not None:
+ node.next.prev = node.prev
+ # INSERT
+ node.next = self._most_recently_used
+ self._most_recently_used = node
+ node.next.prev = node
+ node.prev = None
+
+ def _remove_node(self, node):
+ if node is self._last_recently_used:
+ self._last_recently_used = node.prev
+ self._cache.pop(node.key)
+ # If we have removed all entries, remove the head pointer as well
+ if self._last_recently_used is None:
+ self._most_recently_used = None
+ node.run_cleanup()
def _remove_lru(self):
"""Remove one entry from the lru, and handle consequences.
@@ -142,11 +242,7 @@
If there are no more references to the lru, then this entry should be
removed from the cache.
"""
- key = self._queue.popleft()
- self._refcount[key] -= 1
- if not self._refcount[key]:
- del self._refcount[key]
- self._remove(key)
+ self._remove_node(self._last_recently_used)
def clear(self):
"""Clear out all of the cache."""
@@ -164,11 +260,8 @@
if after_cleanup_count is None:
self._after_cleanup_count = self._max_cache * 8 / 10
else:
- self._after_cleanup_count = min(after_cleanup_count, self._max_cache)
-
- self._compact_queue_length = 4*self._max_cache
- if len(self._queue) > self._compact_queue_length:
- self._compact_queue()
+ self._after_cleanup_count = min(after_cleanup_count,
+ self._max_cache)
self.cleanup()
@@ -178,7 +271,8 @@
This differs in that it doesn't care how many actual items there are,
it just restricts the cache to be cleaned up after so much data is stored.
- The values that are added must support len(value).
+ The size of items added will be computed using compute_size(value), which
+ defaults to len() if not supplied.
"""
def __init__(self, max_size=1024*1024, after_cleanup_size=None,
@@ -200,33 +294,43 @@
self._compute_size = compute_size
if compute_size is None:
self._compute_size = len
- # This approximates that texts are > 0.5k in size. It only really
- # effects when we clean up the queue, so we don't want it to be too
- # large.
self._update_max_size(max_size, after_cleanup_size=after_cleanup_size)
LRUCache.__init__(self, max_cache=max(int(max_size/512), 1))
def add(self, key, value, cleanup=None):
"""Add a new value to the cache.
- Also, if the entry is ever removed from the queue, call cleanup.
- Passing it the key and value being removed.
+ Also, if the entry is ever removed from the cache, call
+ cleanup(key, value).
:param key: The key to store it under
:param value: The object to store
:param cleanup: None or a function taking (key, value) to indicate
- 'value' sohuld be cleaned up.
+ 'value' should be cleaned up.
"""
- if key in self._cache:
- self._remove(key)
+ node = self._cache.get(key, None)
value_len = self._compute_size(value)
if value_len >= self._after_cleanup_size:
+ # The new value is 'too big to fit', as it would fill up/overflow
+ # the cache all by itself
+ trace.mutter('Adding the key %r to an LRUSizeCache failed.'
+ ' value %d is too big to fit in a the cache'
+ ' with size %d %d', key, value_len,
+ self._after_cleanup_size, self._max_size)
+ if node is not None:
+ # We won't be replacing the old node, so just remove it
+ self._remove_node(node)
+ if cleanup is not None:
+ cleanup(key, value)
return
+ if node is None:
+ node = _LRUNode(key, value, cleanup=cleanup)
+ self._cache[key] = node
+ else:
+ self._value_size -= node.size
+ node.size = value_len
self._value_size += value_len
- self._cache[key] = value
- if cleanup is not None:
- self._cleanup[key] = cleanup
- self._record_access(key)
+ self._record_access(node)
if self._value_size > self._max_size:
# Time to cleanup
@@ -242,10 +346,9 @@
while self._value_size > self._after_cleanup_size:
self._remove_lru()
- def _remove(self, key):
- """Remove an entry, making sure to maintain the invariants."""
- val = LRUCache._remove(self, key)
- self._value_size -= self._compute_size(val)
+ def _remove_node(self, node):
+ self._value_size -= node.size
+ LRUCache._remove_node(self, node)
def resize(self, max_size, after_cleanup_size=None):
"""Change the number of bytes that will be cached."""
=== modified file 'bzrlib/tests/test_lru_cache.py'
--- a/bzrlib/tests/test_lru_cache.py 2009-03-23 14:59:43 +0000
+++ b/bzrlib/tests/test_lru_cache.py 2009-03-24 17:01:50 +0000
@@ -25,6 +25,16 @@
class TestLRUCache(tests.TestCase):
"""Test that LRU cache properly keeps track of entries."""
+ def test_cache_size(self):
+ cache = lru_cache.LRUCache(max_cache=10)
+ self.assertEqual(10, cache.cache_size())
+
+ cache = lru_cache.LRUCache(max_cache=256)
+ self.assertEqual(256, cache.cache_size())
+
+ cache.resize(512)
+ self.assertEqual(512, cache.cache_size())
+
def test_missing(self):
cache = lru_cache.LRUCache(max_cache=10)
@@ -63,18 +73,6 @@
self.failIf('foo' in cache)
- def test_queue_stays_bounded(self):
- """Lots of accesses does not cause the queue to grow without bound."""
- cache = lru_cache.LRUCache(max_cache=10)
-
- cache['baz'] = 'biz'
- cache['foo'] = 'bar'
-
- for i in xrange(1000):
- cache['baz']
-
- self.failUnless(len(cache._queue) < 40)
-
def test_cleanup(self):
"""Test that we can use a cleanup function."""
cleanup_called = []
@@ -92,7 +90,8 @@
# 'foo' is now most recent, so final cleanup will call it last
cache['foo']
cache.clear()
- self.assertEqual([('baz', '1'), ('biz', '3'), ('foo', '2')], cleanup_called)
+ self.assertEqual([('baz', '1'), ('biz', '3'), ('foo', '2')],
+ cleanup_called)
def test_cleanup_on_replace(self):
"""Replacing an object should cleanup the old value."""
@@ -179,7 +178,7 @@
cache.cleanup()
self.assertEqual(2, len(cache))
- def test_compact_preserves_last_access_order(self):
+ def test_preserve_last_access_order(self):
cache = lru_cache.LRUCache(max_cache=5)
# Add these in order
@@ -189,20 +188,14 @@
cache.add(4, 30)
cache.add(5, 35)
- self.assertEqual([1, 2, 3, 4, 5], list(cache._queue))
+ self.assertEqual([5, 4, 3, 2, 1], [n.key for n in cache._walk_lru()])
# Now access some randomly
cache[2]
cache[5]
cache[3]
cache[2]
- self.assertEqual([1, 2, 3, 4, 5, 2, 5, 3, 2], list(cache._queue))
- self.assertEqual({1:1, 2:3, 3:2, 4:1, 5:2}, cache._refcount)
-
- # Compacting should save the last position
- cache._compact_queue()
- self.assertEqual([1, 4, 5, 3, 2], list(cache._queue))
- self.assertEqual({1:1, 2:1, 3:1, 4:1, 5:1}, cache._refcount)
+ self.assertEqual([2, 3, 5, 4, 1], [n.key for n in cache._walk_lru()])
def test_get(self):
cache = lru_cache.LRUCache(max_cache=5)
@@ -213,6 +206,9 @@
self.assertIs(None, cache.get(3))
obj = object()
self.assertIs(obj, cache.get(3, obj))
+ self.assertEqual([2, 1], [n.key for n in cache._walk_lru()])
+ self.assertEqual(10, cache.get(1))
+ self.assertEqual([1, 2], [n.key for n in cache._walk_lru()])
def test_keys(self):
cache = lru_cache.LRUCache(max_cache=5, after_cleanup_count=5)
@@ -278,7 +274,6 @@
def test_basic_init(self):
cache = lru_cache.LRUSizeCache()
self.assertEqual(2048, cache._max_cache)
- self.assertEqual(4*2048, cache._compact_queue_length)
self.assertEqual(int(cache._max_size*0.8), cache._after_cleanup_size)
self.assertEqual(0, cache._value_size)
@@ -293,29 +288,46 @@
self.assertEqual(0, cache._value_size)
cache.add('my key', 'my value text')
self.assertEqual(13, cache._value_size)
- cache._remove('my key')
+ node = cache._cache['my key']
+ cache._remove_node(node)
self.assertEqual(0, cache._value_size)
def test_no_add_over_size(self):
"""Adding a large value may not be cached at all."""
cache = lru_cache.LRUSizeCache(max_size=10, after_cleanup_size=5)
self.assertEqual(0, cache._value_size)
- self.assertEqual({}, cache._cache)
+ self.assertEqual({}, cache.items())
cache.add('test', 'key')
self.assertEqual(3, cache._value_size)
- self.assertEqual({'test':'key'}, cache._cache)
+ self.assertEqual({'test': 'key'}, cache.items())
cache.add('test2', 'key that is too big')
self.assertEqual(3, cache._value_size)
- self.assertEqual({'test':'key'}, cache._cache)
+ self.assertEqual({'test':'key'}, cache.items())
# If we would add a key, only to cleanup and remove all cached entries,
# then obviously that value should not be stored
cache.add('test3', 'bigkey')
self.assertEqual(3, cache._value_size)
- self.assertEqual({'test':'key'}, cache._cache)
+ self.assertEqual({'test':'key'}, cache.items())
cache.add('test4', 'bikey')
self.assertEqual(3, cache._value_size)
- self.assertEqual({'test':'key'}, cache._cache)
+ self.assertEqual({'test':'key'}, cache.items())
+
+ def test_no_add_over_size_cleanup(self):
+ """If a large value is not cached, we will call cleanup right away."""
+ cleanup_calls = []
+ def cleanup(key, value):
+ cleanup_calls.append((key, value))
+
+ cache = lru_cache.LRUSizeCache(max_size=10, after_cleanup_size=5)
+ self.assertEqual(0, cache._value_size)
+ self.assertEqual({}, cache.items())
+ cache.add('test', 'key that is too big', cleanup=cleanup)
+ # key was not added
+ self.assertEqual(0, cache._value_size)
+ self.assertEqual({}, cache.items())
+ # and cleanup was called
+ self.assertEqual([('test', 'key that is too big')], cleanup_calls)
def test_adding_clears_cache_based_on_size(self):
"""The cache is cleared in LRU order until small enough"""
@@ -329,7 +341,7 @@
# We have to remove 2 keys to get back under limit
self.assertEqual(6+8, cache._value_size)
self.assertEqual({'key2':'value2', 'key4':'value234'},
- cache._cache)
+ cache.items())
def test_adding_clears_to_after_cleanup_size(self):
cache = lru_cache.LRUSizeCache(max_size=20, after_cleanup_size=10)
@@ -341,7 +353,7 @@
cache.add('key4', 'value234') # 8 chars, over limit
# We have to remove 3 keys to get back under limit
self.assertEqual(8, cache._value_size)
- self.assertEqual({'key4':'value234'}, cache._cache)
+ self.assertEqual({'key4':'value234'}, cache.items())
def test_custom_sizes(self):
def size_of_list(lst):
@@ -357,7 +369,7 @@
cache.add('key4', ['value', '234']) # 8 chars, over limit
# We have to remove 3 keys to get back under limit
self.assertEqual(8, cache._value_size)
- self.assertEqual({'key4':['value', '234']}, cache._cache)
+ self.assertEqual({'key4':['value', '234']}, cache.items())
def test_cleanup(self):
cache = lru_cache.LRUSizeCache(max_size=20, after_cleanup_size=10)
More information about the bazaar-commits
mailing list