================== TODO: nabu ================== .. contents:: .. 1 Misc 2 Server Setup 3 Directives 4 Extractors 4.1 Definitions 4.2 Questions 5 Jeff's Suggestions Misc ==== - nabu/docutils: Create a web page on nabu for docutils proposals for establishing conventions for microformats, use my initial Nabu stuff and send an email to the mailing-list asking for feedback. Eventually we want to move some of Nabu's functionality directly into docutils. - I want, like a wiki, to be able to serve documents from a local checkout. IMHO this whole thing would be simpler if the file came out of a Subversion repository and were accessible as such, e.g. references to local documents from the file would be translated to filenames from the subversion repository. I want to do this:: .. image:: myfigure.png in a directory:: /path/to/blog/losangeles/entry.txt /path/to/blog/losangeles/myfigure.png and for all the paths to be found automatically. This means that Nabu would have to become tied to a Subversion checkout. I think that might even be reasonable at some point, if it allows me to be able to access the files. I don't know. - Implement a swish-e index on the database and link it up to walus. - Check out links: - stikkit: - m[uo]ndial: db of place-names in the world - “The Predictors”, book, see ref. - Interrupting the nabu client should be handled. - Add a feature to directly email a brand-new peephole from the website (just enter the email addresses and send it from there). - Add a field for some kinds of entries that allows us to store the order within a document in which the entry occurs. We should be able, for example, to extract the five first entries showing up in a source file. - Produce a photo for my main website page, a series of photo-booth photographs stacked next to each other, vertically, for the main page. - Birthday fields do not seem to extract properly, for the most pat - Make my addrbook table available for my mobile phone - You **REALLY** need to support multiple profiles for running nabu manually. It's really annoying to delete the entire server database by accident. - We really need to detect a syntax for “Question”. e.g. Q: - We need to document the different input syntax extensions, in a single, very simple, document. - All the astext() functions from the extractors should be replaced by a function that checks for the presence of lists, and concatenate if the arg is a list. See book() for example. - Skip symbolic links by default. - Fix Nabu in that if there is an error, the client script should keep going to try to publish the other documents. - Can I use views to hide the disclosure requirement in my queries entirely? e.g.:: CREATE VIEW documents_public AS SELECT * FROM document; CREATE VIEW documents_shared AS SELECT * FROM document WHERE disclosure > 1; CREATE VIEW documents_private AS SELECT * FROM document WHERE disclosure > 2; And the same for all the remaining tables? All the queries would then have to append the correct names from the tables they fetch from. - Complete the movie extractor to actually store the list of movies, and to render the link using the catalog if there is a catalog number. - We need to provide a way for the extractors to reliably report errors, some sort of function. - You need to define a syntax for multiple links sharing the same description and keywords - When an error occurs during upload, we should log the trace of the exception to the logging module, in order to be able to debug problems easily. (see gijsbert's email). - We need to process the transforms within a single transaction, and to roll it back if there is an error, including rolling back the original deletions from all the table. The point is that we want to maintain database consistency even if there is an error while processing the document. This is very important! - Make the source storage objects use a acquire/release pattern compatible with antipool, so that we could create them once in a mod_python handler rather than every time. - When we change a document's unique Id, the old document remains in the database… is there anything we can do about that? - We need to detect encoding errors better, I have too many documents which do not get detected/guessed automatically - Write an extractor for all my google maps links and create a list of them, so I can find locations easily. MapExtractor - Write an extractor for dictionary definitions; in this format: defn: unwonted Definition of unwontedness. - Find how to suppress output of errors in HTML conversion, to avoid catching this output in the extractors and having this output pollute the contents of the extractors. e.g. photography.txt and contacts. - Nabu ``-l`` is broken. - Test and check with docutils 0.4 - Revise and fix ``client.py --dump`` - Add a "--config=" option to client rather than force setting NABURC. - Make the html rendering not render output errors. - support encrypted files from the publisher - problem: how do we identify documents to be published without decrypting? - emacs config must keep plain text at the top of the file when decrypting in a variable and add it back in when reencrypting - shares some of the advantages of XML, structured hirarchical tree of nodes Server Setup ============ - implement asynchronous document processing on the server - check database locking issues in more detail? Directives ========== - Do something about the locator directive Extractors ========== - warn on files that have no titles (add minimum requirements extractor) - there should be and .. end directive to stop the processing from that point on. - .. disclosure:: private directive could achieve something similar - add a generic field list parser for creating new types of records without code: the entry type is the first field list, if it matches some pattern (e.g. if it's empty); - url extractor: when right after a title, a list of URLs should use the title as the description string Definitions ----------- Support this: :paramour • noun archaic or derogatory a lover, especially the illicit partner of a married person. — ORIGIN from Old French par amour ‘by love’. :“la monstre tune” expr: beaucoup de fric. - the first is a word definition - the second, with either “, ", or ” is an expression Questions --------- Support the following syntax for questions: :Q: dhs dhs dhshd shdsd hsds as Emperor (Tsar) of Bulgaria from 1331 to 1371, during the Second Bulgarian Empire. The date of his birth is unknown. He died on February 17, 1371. The long reign of Ivan Alexander is considered a transitional period in Bulgarian medieval history. Ivan Alexander began his rule by dealing with internal problems and external threats from Bulgaria's neighbours, the Byzantine Empire and Serbia, as well as leading his empire into a period of economic recovery and cultural and religious renaissance. However, the emperor was later unable to cope with the mounting incursions of Ottoman forces, Hungarian invasions from the northwest and the Black Death. In an ill-fated attempt to combat these problems, he divided the country ...that competitions for the design of José Martí Memorial (pictured) in Havana, Cuba started in 1939, but the design that was finally constructed in 1953 was a variation on a design that had come in third .in the fourth competition? Jeff's Suggestions ================== Some things to be taken care of in here: > Hi Martin! I've been playing with Nabu for the past couple of days, writing > my own extractors. I'm looking at using it for the Python Advocacy effort, > where difference people can write documents on certain topics and Nabu links > them together in a way more structured than a wiki. You can see a preliminary > page at advocacy.dfwpython.org, where the various drop-down links will go to > documents processed by Nabu. Oopsy I get an error. Bad Gateway The proxy server received an invalid response from an upstream server. > - In the extractor sources, I would add a comment warning people that only > the first instance of a field in a block gets taken. You mean "in a field list". That depends on the extractor. The extractors are still changing a fair bit, and I'm not sure I consider them to be a part of Nabu just yet. The ones I provide are the ones I use, but they really are just examples. I'm thinking of separating them at some point. Not sure it matters, there are so few users yet anyway. > - I wish there were a way for a date created/modified. I can store _a date_ > in the store() method, but there is no distinction between create and modify > at this level since Nabu entirely removes a record before adding/modifying > it. Admittedly this simplifies things, omitting the need for SQL MODIFY > directive, but it loses the create/modify distinction. Perhaps Nabu could > stash the record from the __sources__ table and pass it to store(), so that > store could know if this was the first or Nth time this document has been > seen and have access to the date column in __sources__. Even that would not be sufficient: things extracted from the document themselves can be added, deleted or modified. Unless there is a way to uniquely identify those "things" (whatever they are, bookmark, contact info, etc.), there is no way to figure that out. IMO the problem is ill-defined. However I think it would be possible to implement this for the documents themselves, as a special exception. I'm not sure I need it though. > I know how to do this now, but it wasn't spelled out in the docs. thanks! :-) > - It would be useful for the extractors to have a method invoked at the > start and end of processing a series of documents, for the purpose of > generating summary views across a set of documents that might have changed. Great idea! > - It was unclear how the order in which extractors appear in the transforms > list relate to the 'default_priority' field within each extractor. They > each control different aspects of processing but I don't quite get it. The default_priority defines the ordering. This is more of a docutils issue that needs to be documented in the Nabu extractors developers guide. > I was at one point using multiple tables per extractor, for a many-to-many > relationship and hit the following: > > - In extract.py, class SQLExtractorStorage, the __init__ logic iterates over > the sql_tables dictionary, creating them. However if there are dependencies > between the tables, such as one table for values and a second table for a > many <-> many relationship, the second table may get created first, > depending upon the vagueries of dictionary hashes. I changed __sql_tables__ > to use an 'ordered dictionary' from the Python Cookbook so that the tables > were created in the order listed. Hehe I also hit this problem not too long ago. I changed the list to be a list of tuples instead, which now also includes the type of the object, e.g. sql_relations_unid = [ ('document', 'TABLE', ''' CREATE TABLE document ( unid TEXT PRIMARY KEY, title TEXT, author TEXT, date DATE, abstract TEXT, location TEXT, -- Disclosure is -- 0: public -- 1: shared -- 2: private disclosure INT DEFAULT 2, -- If this document is a tag index, the tag for which it is. tagindex TEXT DEFAULT NULL ) '''), ('tags', 'TABLE', ''' CREATE TABLE tags ( unid TEXT NOT NULL, tagname TEXT ) '''), ] sql_relations = list(locresolv.schemas) + [ ('tagindex_idx', 'INDEX', """CREATE INDEX tagindex_idx ON document (tagindex)""") ] > I also use table/column constraints in my database and had trouble controlling > the order in which tables were created/dropped, to satisfy those dependencies. > I use 'CASCADE ON DELETE' on my tables to automate most of it. I used the > 'ordered dictionary' trick to handle some more of it, and then wrestled with > the default_priority/transforms[] ordering. However I still had these problems: (same as above.) > - I wanted my new tables to have their 'unid' dependent upon the Nabu master > table which, I thought at first, was the doctree table (later I saw that > it really is __sources__). The problem is that I had to add a FinalDocTree yes, __sources__ is the single master resource. > and InitialDocTree, to initially INSERT the unid into the doctree table > at the start of processing, so that other table's later external references > would work. The FinalDocTree recorded the pickle of the document after all > transforms have run. Personally I just store the final doctree and get rid of the initial doctree. I don't have a constraint indeed, and I really should do that. > - When I switched to __sources__ I saw that you don't insert the document's > info until the end of processing, breaking referential constraints. > However because the various extractors each do their own database commits, > this means the database is in an indeterminate state during processing. > I'd like to see Nabu insert the info entry at the start of extractor > processing and avoid doing any database commits until the end of all > extractor processing. Yes (this is related to the above, I should fix it, great observation). > - The database security info is in the cgi-bin file connect.py. I had to > make sure that directory was execute-only re Apache. Bah, Apache stuff. Always a PIA that stuff. > - The docs need a bit more Apache setup info, re cgi-bin, HTTP auth and > perhaps even how to use HTTPS/SSL for non-public portions of the > interaction. sure. > - For some reason, the cgi-bin program nabu-contents.cgi always displays > a 404 Not Found message at _the bottom_ of the content. Oopsy. I think I fixed that a while ago. Could you fetch the snapshot and confirm? > - If a 'user=' is not specified in the nabu.conf file, and one is not passed > in on the command-line, nabu connects as user 'None'. Perhaps it ought to > print an error message. Ooppsy indeed. Sounds like a bug. > - When there is an error in the CGI, the traces show the database login info. > This may be a problem with the xmlrpclib, but it was disconcerning. Hmmm. Sounds more like a CGI problem, not sure what I can do about that one. Note that I did a lot of work on the error reporting since your snapshot (dec 2005). That used to be one of the weak areas, and sometimes still is when an extractor crashes. One thing I'd like to do is handle xmlrpclib errors more gracefully. > - The docs need an expanded section on the "marker_regexp" and how document > ids work. First, the wording is unclear on whether _each_ document needs > a unique id or whether a _set_ of documents need a group id. As it turns > out, the answer is both. The docs give an example of a unicode spec > preceding the :Id: field, but I didn't realize that you had to adjust your > marker_regexp to _match_ that unicode spec. So with the marker_regexp Hmm do you have to? The client.py is doing re.search(), so you should not have to change it. My .naburc does not customize the regexp. I just tested with .. -*- coding: utf-8 -*- :Id: testttt and it worked for me. > field you can select, via a proper pattern, _which_ documents are examined > and then you also set the pattern to extract the particular id you want. > Some examples would help in the docs. This also needs to clearly state Will do. > the purpose of the (group) portion, and also what part of the pattern > will be removed from the document before transmission. I'm still getting > :Id: transmitted but with a blank value and I'm not sure why. I'm sure > its to do with the pattern match. I'm not sure what you mean with the "group" portion. What is this? Hi Jeff I did a little bit of work on Nabu while the laundry was going: -------------------- 1. The nabu command uses a 4-argument call to process_source but the nabu server wants 5-arguments. I added reporting_level to the client and it works now. Hmm that's strange, I have this in client.py:223 :: errors, messages = server.process_source( pfile.unid, pfile.fn, xmlrpclib.Binary(pfile.contents), report_level) and this in server.py:269 :: def process_source(self, unid, filename, contents_bin, report_level): Do we have the same version? -------------------- 2. In nabu-publish-handler.cgi, there is an invocation of the global method 'create_server' but it isn't imported or defined in that .cgi file. Something got refactored incorrectly, I think. FIXED -------------------- 1. In the document extractor there is a reference to a 'walus' module, but no sign of it in a normal install. I've commented it out. Perhaps a conditional import test to drop out its functionality if not present? The problem here was that I have a library called locresolv in another project called 'walus' that depends on another project 'antiorm' that I do not want nabu to depend on. I have moved locresolv to nabu and make it a conditional import. One remaining issue is that locresolv has some more work to be done on it in order to be usable, so for now I have made it disabled. BTW, locresolv is a cool library that allows you enter location names loosely, given the assumption that the components of a location name are distinct, for example, if this location has been seen at least once and is available in the database:: Farmer's Market, Beverly Hills, Los Angeles, California, USA thereafter all of the following location entries will resolve to the same place:: Farmer's Market Farmer's Market, Beverly Hills Farmer's Market, Beverly Hills, USA Farmer's Market, California, USA Farmer's Market, Los Angeles, USA Basically, it will resolve the name to its most complete form, so that when you create a new blog entry you don't have to think about it too much. -------------------- 3. Currently I'm tracking down an inability to do basic things with the server - I'm getting an XML-RPC error code of -1, with no traceback or error text. This is when it is attempting to 'ping' the server right after creating the server proxy. XML-RPC is notoriously terse when there is an exception on the server side--almost nothing is brought back to the client. I definitely need to do something about this. FIXME -------------------- 2. I chased what I thought was a problem, in that my old extractors stopped extracting data under the new snapshot. No errors, they just didn't match anything. The cause was a node.clear() call in the document extractor. It removed the whole docinfo section, but I had a half dozen extractors plucking out various fields in that area. Removed the node.clear() and they work again. I remember fixing a bug with the docinfo but do not remember if that was what you describe. The node.clean() call in document.py:: def depart_docinfo(self, node): self.in_docinfo = 0 # Remove the bibliographic fields after processing. node.clear() raise nodes.StopTraversal is necessary because we do not want to leave the docinfo fields visible in the output rendered document. I think that it would be wiser to do this at rendering time instead. So I removed the node.clear() call, and changed the renderer in Walus instead to not render the docinfo fields. -------------------- > I thought I'd send along some feedback on the things I learned along the way. > I was using the Dec 2005 release, not the intermediate SVN versions, > figuring the formal release is more stable. Actually, the latest is the most stable. I fixed a lot of bugs since then. Always use the snapshots. I want to setup a public SVN server but I cant' right now because the SSL on my machine is hogged for my commercial application. I spent some time today trying to setup a public Subversion repository for you to use; it's taking time, I'm having troubles with DAV, trying to fix this with my sysadmin ppl, I think the firewall rules do not allow it. I have setup a trac though: http://furius.ca/trac/public The subversion will be at http://furius.ca/svn/public when it's working (GET requests are already working, you still can't checkout or ls, I don't know why) -------------------- 3. Chased another false alarm; I missed changing one of my extractors from using sql_tables to sql_relations_unid in the storage class. However Nabu raised no alarm as a result of the SQL storage class not having any SQL tables. Perhaps a sanity check would be useful, at least as a warning. Well, there are defaults, and the defaults are empty. You would like to force the declarations instead? -------------------- > - I'd like to see some discussion in the docs about using the command-line > switches. For example, if I add an extractor on the server, but none of > my documents have changed, how do I force a rerun of all my documents? Yes, I will add that in the docstring for the nabu client. "nabu --help" has so many options that it isn't very clear how to use it. A more human-readable document is in order. Done (some). -------------------- > - At first I couldn't get nabu-contents.cgi to work, until I realized that > it is tied to the specific set of tables used elsewhere. I had book and > link disabled as I wasn't using them, and was puzzled by the errors. Yes, those scripts are meant to be used as examples. In my current presentation layer, all that stuff is done wihtin resource handlers with my Ranvier framework. I don't even use the CGI scripts. They are just there for demo. I should indicate that more clearly. I add a big nasty warning in the cgi-bin publisher script. -------------------- > - In the docs, the section that talks about ~/.naburc or nabu.conf is unclear > whether the client, server or both uses it. I now know that only the > client uses those files, and that the server uses settings in the cgi-bin > files. Will clarify, thx. Done. -------------------- > - All the tutorials are unclear about transforms. Does simply "extracting" > information from the tree necessarily alter it? I didn't see much on why > and how you'd actually _modify_ the tree. Some expanded material on when > and how to actually transform the tree would be cool. I've written a book extractor that modifies the tree recently. Basically, every time you want a list of things expressed as field lists the HTML rendering looks pretty bad (render a list of field lists with the HTML renderer and you'll see what I mean, it renders horrible, I think this is because of the way the HTML renderer works in most browsers). My books get converted into DIVs now, it looks much better. Note that you may be able to make this change at render time rather than at extraction time. Both are possible. I added a section on this in the tutorial.