Rev 7: * Added ``index.init_index`` to create new indices. (Robert Collins) in http://people.ubuntu.com/~robertc/baz2.0/plugins/search/trunk

Robert Collins robertc at robertcollins.net
Sun Jun 8 12:45:04 BST 2008


At http://people.ubuntu.com/~robertc/baz2.0/plugins/search/trunk

------------------------------------------------------------
revno: 7
revision-id: robertc at robertcollins.net-20080608114120-2gjfj34t8xuqp2qt
parent: robertc at robertcollins.net-20080608083710-cd001ee9lrgicdu8
committer: Robert Collins <robertc at robertcollins.net>
branch nick: trunk
timestamp: Sun 2008-06-08 21:41:20 +1000
message:
  * Added ``index.init_index`` to create new indices. (Robert Collins)
  * Added ``Index.index_revisions`` to request indexing of revisions.
    (Robert Collins)
  * Added ``Index.indexed_revisions`` to report on indexed revisions.
    (Robert Collins)
added:
  DESIGN                         design-20080608072426-vjoj110dtykfyb7g-1
modified:
  NEWS                           news-20080608052041-z5bahsl8kwl0uf4x-2
  errors.py                      errors.py-20080608055509-hnimeek7q8tctkqf-1
  index.py                       index.py-20080608055509-hnimeek7q8tctkqf-2
  tests/test_errors.py           test_errors.py-20080608055509-hnimeek7q8tctkqf-3
  tests/test_index.py            test_index.py-20080608055509-hnimeek7q8tctkqf-4
=== added file 'DESIGN'
--- a/DESIGN	1970-01-01 00:00:00 +0000
+++ b/DESIGN	2008-06-08 11:41:20 +0000
@@ -0,0 +1,107 @@
+This document contains notes about the design of bzr search.
+
+Naming documents
+++++++++++++++++
+
+I plan to use a single namespace for all documents: revisions, inventories and
+file texts, branch tags etc.
+
+What to index
++++++++++++++
+
+I think we can sensibly index revisions, tree-shapes, and file texts. One very
+interesting question is how to index deltas - for instance should searching for
+'foo bar' return only the document-versions where 'foo bar' was introduced? Or
+those where it was merged (as bzr annotate does), or any document-version where
+'foo bar' is present at all ? In the short term, I'm punting on a real answer
+to this because it requires serious thought, and I want to deliver something
+for people to play with, so I'm going to simply annotate every text, and take
+only the terms from lines ascribed to version being indexed. This will require
+some care to avoid either being pathologically slow, or blowing out memory on
+large trees.
+
+Indexing non-text
+=================
+
+I think it would be good to have a filtering capability so that e.g. openoffice
+files can generate term-lists. This relates to the 'index deltas' question in
+that external filters are likely n
+
+How to index
+++++++++++++
+
+To allow incremental index operations, indices should be committed to regularly
+during the index process. To allow cheap 'what is not indexed' queries, we can
+use revision ids as a 'indexed' flag in the index. So probably within <batch
+size> we should index by file_id (for annotation efficiency) and across <batch
+size> we effectively index topologically (though I think partial indices are a 
+really good idea to allow automatic-updates).
+
+Where to store indices
+++++++++++++++++++++++
+
+There are many possible places to store indices. Ideally they get deleted when
+a branch is deleted, can be shared between branches, update automatically etc.
+For now, I'm going to store them in .bzr/bzr-search when the bzrdir is a
+MetaDir, and just refuse to index other branches. To index an svn branch, I
+would expect to use a look-aside table and store the index in e.g.
+~/.bazaar/search/<encoded_url>. Storing in bzr-search is a bit of an
+abstraction violation, but we have a split-out structure for a reason, and by
+prefixing it with bzr-search I leave 'search' available for a 'core' version of
+this feature.
+
+Search engine to use
+++++++++++++++++++++
+
+I looked at: pylucence (java dependency, hard to integrate into our VCS),
+xapwrap (abandonware), lupy(abandonware). So in the interested of expendiency,
+I'm writing a trivial boolean term index for this project.
+
+
+Disk Format
++++++++++++
+
+I've modelled this after a pack repository.
+
+So we have a lock dir to control commits.
+
+A names file listing the indices.
+
+First cut
+=========
+
+This is designed to get the system up and running with least effort, not to
+scale.
+
+Each index is then GraphIndex(1, 3) containing:
+('x', 'x', term) -> posting list
+('x', 'r', revid) -> () # indicates revid is indexed.
+
+Note that this contrains terms to be valid index element keys, but we can
+always encode during the index operation if needed.
+
+For scaling we need to handle extremely long posting lists.
+
+Second cut
+==========
+
+Each name in the names file is a pack, its length, and two read vectors for the
+terms and revisions lists. The pack contains a series of posting lists trailed
+by a top level index of terms and indexed revisions.
+
+Posting list
+------------
+
+A GraphIndex(0, 3)
+
+(type, file_id, versioned_id) -> nothing
+
+Terms index
+-----------
+
+A GraphIndex(0, 1)
+(term,) -> start, end of the terms posting list in the pack.
+
+Revisions index
+---------------
+(revision,) -> nothing

=== modified file 'NEWS'
--- a/NEWS	2008-06-08 06:42:53 +0000
+++ b/NEWS	2008-06-08 11:41:20 +0000
@@ -13,7 +13,7 @@
 
   FEATURES:
 
-    * New command ``index`` to create a search index.
+    * New command ``index`` to create a search index for a branch.
       (Robert Collins)
 
     * New command ``search`` to search a search index from within bzr.
@@ -33,6 +33,14 @@
       functions for obtaining an index and creating/updating an index.
       (Robert Collins)
 
+    * Added ``index.init_index`` to create new indices. (Robert Collins)
+
+    * Added ``Index.index_revisions`` to request indexing of revisions.
+      (Robert Collins)
+
+    * Added ``Index.indexed_revisions`` to report on indexed revisions.
+      (Robert Collins)
+
     * New modules: ``commands``, ``errors``, ``index``. These contain the
       console ui, exceptions, and the search index core respectively.
       (Robert Collins)

=== modified file 'errors.py'
--- a/errors.py	2008-06-08 05:57:24 +0000
+++ b/errors.py	2008-06-08 11:41:20 +0000
@@ -22,6 +22,15 @@
 import bzrlib.errors
 
 
+class CannotIndex(BzrError):
+    """Raised when a particular control dir class is unrecognised."""
+
+    _fmt = "Cannot index %(thing)r, it is not a known control dir type."
+
+    def __init__(self, thing):
+        self.thing = thing
+
+
 class NoSearchIndex(BzrError):
     """Raised when there is no search index for a url."""
 

=== modified file 'index.py'
--- a/index.py	2008-06-08 06:54:53 +0000
+++ b/index.py	2008-06-08 11:41:20 +0000
@@ -18,17 +18,48 @@
 """The core logic for search."""
 
 from bzrlib import branch as _mod_branch
+from bzrlib.bzrdir import BzrDirMeta1
+from bzrlib.errors import NotBranchError, NoSuchFile, UnknownFormatError
+from bzrlib.index import CombinedGraphIndex, InMemoryGraphIndex
+from bzrlib.lockdir import LockDir
 from bzrlib.plugins.search import errors
 
 
+_FORMAT_1 = 'bzr-search search folder 1\n'
+
+
+def init_index(branch):
+    """Initialise an index on branch."""
+    if not isinstance(branch.bzrdir, BzrDirMeta1):
+        # We don't know how to handle this format.
+        raise errors.CannotIndex(branch)
+    transport = branch.bzrdir.transport
+    transport.mkdir('bzr-search')
+    index_transport = transport.clone('bzr-search')
+    lockdir = LockDir(index_transport, 'names-lock')
+    lockdir.create()
+    lockdir.lock_write()
+    try:
+        index_transport.put_bytes('format', _FORMAT_1)
+        names_list = InMemoryGraphIndex(1, 3)
+        index_transport.put_file('names', names_list.finish())
+    finally:
+        lockdir.unlock()
+    return open_index_url(branch.bzrdir.root_transport.base)
+
+
 def index_url(url):
     """Create or update an index at url.
 
     :param url: The url to index.
     :return: The resulting search index.
     """
-    branch = _mod_branch.Branch.open(url)
-    return Index()
+    try:
+        index = open_index_url(url)
+    except errors.NoSearchIndex:
+        branch = _mod_branch.Branch.open(url)
+        index = init_index(branch)
+    return index
 
 
 def open_index_url(url):
@@ -38,8 +69,49 @@
     :return: A search index.
     :raises: NoSearchIndex if no index can be located.
     """
-    raise errors.NoSearchIndex(url)
+    try:
+        branch = _mod_branch.Branch.open(url)
+    except NotBranchError:
+        raise errors.NoSearchIndex(url)
+    return Index(branch.bzrdir.transport.clone('bzr-search'))
 
 
 class Index(object):
     """A bzr content index."""
+
+    def __init__(self, index_transport):
+        """Create an index stored at index_transport.
+
+        :param index_transport: The path where the index data should be stored.
+        """
+        self.transport = index_transport
+        try:
+            format = self.transport.get_bytes('format')
+        except NoSuchFile:
+            raise errors.NoSearchIndex(index_transport)
+        if _FORMAT_1 != format:
+            raise UnknownFormatError(format, 'bzr-search')
+        self._indices = []
+        self._index = CombinedGraphIndex(self._indices)
+
+    def index_revisions(self, branch, revisions_to_index):
+        """Index some revisions from branch.
+        
+        :param branch: A branch to index.
+        :param revisions_to_index: A set of revision ids to index.
+        """
+        # TODO: split into groups of <reasonable memory size>
+        # here: index texts
+        # here: index inventory/paths
+        # here: index revision
+        index = InMemoryGraphIndex(1, 3)
+        for rev_id in revisions_to_index:
+            index.add_node(('x', 'r', rev_id), '', ((),))
+        self._indices.append(index)
+        # write to disc.
+        # pack etc.
+
+    def indexed_revisions(self):
+        """Return the revision_keys that this index contains terms for."""
+        for node in self._index.iter_entries_prefix([('x', 'r', None)]):
+            yield node[1][2:3]

=== modified file 'tests/test_errors.py'
--- a/tests/test_errors.py	2008-06-08 05:57:24 +0000
+++ b/tests/test_errors.py	2008-06-08 11:41:20 +0000
@@ -23,6 +23,12 @@
 
 class TestErrors(TestCaseWithTransport):
 
+    def test_cannot_index(self):
+        error = errors.CannotIndex('a branch')
+        self.assertEqualDiff(
+            "Cannot index 'a branch', it is not a known control dir type.",
+            str(error))
+    
     def test_no_index_error(self):
         error = errors.NoSearchIndex('a url')
         self.assertEqualDiff(

=== modified file 'tests/test_index.py'
--- a/tests/test_index.py	2008-06-08 06:54:53 +0000
+++ b/tests/test_index.py	2008-06-08 11:41:20 +0000
@@ -17,18 +17,51 @@
 
 """Tests for the index layer."""
 
-from bzrlib.errors import NotBranchError
+from bzrlib.errors import NotBranchError, UnknownFormatError
+from bzrlib.index import GraphIndex
 from bzrlib.plugins.search import errors, index
 from bzrlib.tests import TestCaseWithTransport
 
 
 class TestIndex(TestCaseWithTransport):
 
+    def test_init_index(self):
+        branch = self.make_branch('foo')
+        search_index = index.init_index(branch)
+        # We should have some basic files on disk, and a valid index returned.
+        self.assertIsInstance(search_index, index.Index)
+        transport = self.get_transport('foo/.bzr/bzr-search')
+        # We expect two files:
+        # - format, containing 'bzr-search search folder 1\n'
+        # - a names file, which is an empty GraphIndex
+        self.assertEqual('bzr-search search folder 1\n',
+            transport.get_bytes('format'))
+        names_list = GraphIndex(transport, 'names', None)
+        self.assertEqual([], list(names_list.iter_all_entries()))
+
+    def test_init_index_unindexable(self):
+        # any non-metadir will do here:
+        branch = self.make_branch('foo', format='weave')
+        self.assertRaises(errors.CannotIndex, index.init_index, branch)
+
     def test_open_no_index_error(self):
         err = self.assertRaises(errors.NoSearchIndex, index.open_index_url,
             self.get_url())
         self.assertEqual(self.get_url(), err.url)
 
+    def test_open_index_wrong_format_errors(self):
+        branch = self.make_branch('foo', format='pack-0.92')
+        search_index = index.init_index(branch)
+        transport = self.get_transport('foo/.bzr/bzr-search')
+        transport.put_bytes('format', 'garbage\n')
+        self.assertRaises(UnknownFormatError, index.Index, transport)
+
+    def test_open_index_missing_format_raises_NoSearchIndex(self):
+        branch = self.make_branch('foo', format='pack-0.92')
+        transport = self.get_transport('foo/.bzr/bzr-search')
+        transport.mkdir('.')
+        self.assertRaises(errors.NoSearchIndex, index.Index, transport)
+
     def test_index_url_not_branch(self):
         self.assertRaises(NotBranchError, index.index_url,
             self.get_url())
@@ -37,3 +70,14 @@
         branch = self.make_branch('foo')
         search_index = index.index_url(self.get_url('foo'))
         self.assertIsInstance(search_index, index.Index)
+
+
+class TestIndexRevisions(TestCaseWithTransport):
+    """Tests for indexing of a set of revisions."""
+
+    def test_empty_one_revision(self):
+        tree = self.make_branch_and_tree('')
+        rev_index = index.init_index(tree.branch)
+        revid = tree.commit('first post')
+        rev_index.index_revisions(tree.branch, [revid])
+        self.assertEqual(set([(revid,)]), set(rev_index.indexed_revisions()))




More information about the bazaar-commits mailing list