We present differing angles to various problems of information management"
Relevance of Information. Not all information that transits in a user's
computer, or that is created by that user, is relevant. There is a great
value in getting rid of junk.
We believe that at this point, this task can only be performed manually: there
is no artificial intelligence algorithm that can be decide what is important
for you. This is a key assumption in the motivation for this system:
ultimately, the user is not just a producer of information, but is also an
editor.
There is no search algorithm that will be able to automatically create value.
However, we recognize that filtering technologies will play an important role
in helping us become more efficient editors, but we do not believe that they
of themselves will become able to create "the valuable" automatically anytime
soon;
Disparate Storage and Unaccessible Source Data. Information is stored in
various bits that have an associated meaning to them--let us call them
"chunks" for now. All information chunks are stored in different places, for
example, all addresses are stored in an address book manager program, all
email in a contacts list program. Contacts lists are stored in PDAs, and
despite the availability of synchronization programs (you're lucky if it even
works), ultimately the data lives in different places and there are multiple
copies of it.
Also, these data storages used different methods, often specific to the data
model that they choose, and only readable by the specific software that
created them. This makes them difficult to access this data, to build
independent services on top.
Services that allow you to enter some of that data online fall in the same
trap: they store the data on their machines in a format that is not accessible
for you, the user, to get it back. You create value by publishing your data
using their system but you do not have a way to get it back in a form suitable
for reuse! ;
Data Entry. It is difficult to enter the information, for many reasons:
every time you need to enter a new type of information, you need to start a
program specific to that information storage. These programs change over
time, and this means that you must learn a plethora of programs, just to
enter the data;
more importantly, many times you would want to mix different types of data
together in one logical unit. For example, you might want to open a text
file when you're researching a specific issue, for example, all information
you find about a recently announced illness by your doctor, and you would
want to store links to URLs of interest, contact information about local
specialists may be able to help you, as well as text that you write yourself
about the illness, or notes that you make on your condition, whatever.
Personally, whenever I embark on any substantial task, I create a new small
document for it--in the form of a text file-- and jot down notes as I
discover more and more aspects of the problem that I'm working on. I think
many people do the same, or would do the same if they found a use for the
document. For example, you might want to share some of this knowledge.
Right now, there is no easy way to do that;
Publishing is Difficult. Publishing much of the information you accumulate
is still very hard. There are many different systems out there which attempt
to provide a way for you to publish certain types of data, in a specialized
way. These require much cusomization, and a market for specialized services
(such as blogger) has emerged for specific uses of the data.
Would it not be great if we could build services on top of all your data,
rather than attempting to solve one specific use of the data?
Also of note: online publishing systems store the data remotely, which make it
difficult to build client software to edit it efficiently. Making it possible
to upload the data in a digested form once it's authored locally has some
potential.
One central value behind our views, and one that you must keep in mind when
considering the proposition that we're about to make, is that of the importance
of simplicity. There is great value in keeping things as simple as they need
to be, because it allows the most flexible reuse of the information.
Much of the rewards of keeping design and data simple can be observed in the
power of the UNIX tools and operating system, which is built upon simple but
very powerful ideas, sockets and files that consist in generic streams of bytes,
and small tools that perform one task really well, and a simple and generic way
to connect those tools together (See "The Art of UNIX Programming",
E.S.Raymond). This has made possible the creation of complex tools without
having to reinvent the small tools, but rather by improving those small tools in
a generic way, that would henceforth allow more possibilities for connecting
them in yet more different ways. Keeping things generic and as simple as
possible is a potent idea.
This idea of designing systems as simple as they need be is also prevalent in
the practice of software development. Over the past ten years, we are seeing
methodologies of development convergence towards this idea. Extreme
programming, agile methodologies, and the growing adoption of dynamic languages
are a direct expression of the quest for reaching closer and closer to the
essence of the problems we're trying to solve while trying to get rid of
unneeded complications. In many ways, software development is in the business
of creating complexity. We are essentially recognizing that keeping our designs
and data models as simple as possible is the most efficient way of controlling
the growth of this complexity.
This history behind the creation of this project stems from a long-standing need
from its author to maintain personal information in a way that is most useful
and that can be kept independent from specific software, over long periods of
time. The sections below outline some of the problems I have tackled in the
past, and the partial solutions I have come to before creating the Nabu
extraction system. Nabu is meant to replace all these tricks to allow me to
extract, organize and selectively publish some of this information.
I needed to maintain an address book. At the time (circa 1993) on software was
decent that output a textual format which could be read for converting the data
into other formats. Thus around 1997, I decided to transcribe all my physical
address books in a text file, following my supervisor's advice at university at
the time, used a paragraph-grep program to query it. This worked great for many
years, except that there was no integration with my email programs. I could
however grep and sed the address book file to generate a text file that could in
turn be imported by various email systems. Over time, the one address book file
grew into many, and new contact information moved gradually into the documents
which provided context for them.
I think at some point I have started using the LDAP LDIF format to store the
files, but the naming was a bit too long or annoying to add entries with a text
editor, so I just created my own simple format, which looks like a list of
entries like this:
n: New Navarino Bakery & Pastry Shop
p: 514-279-7725
a: 5563, avenue du parc, Montréal, QC H2V 4H2
Another issue is that of maintaining a set of bookmarks. One of the problems is
that every few years a new browser comes out, and I end up moving to it. For
example, I started using the web with Xmosaic, and eventually moved to Netscape.
On Windows I eventually had to use IE, and eventually switched to Konqueror on a
Linux machine, and then Mozilla, which was very heavy, so eventually to Firefox.
Most of these browsers have slightly different bookmark storage formats which
are not conveniently edited within emacs.
A more important problem is that of the organization bookmarks. Adding all
bookmarks in a linear list makes it nearly impossible to reuse them efficiently
(it is very hard to find a bookmark that you're looking for). Tree structures
help alleviate this problem to some extent, but add another problem: when you
want to quickly add a bookmark (somehow, it always has to be quick), you have to
choose a single most appropriate place to put it, and if you're not very careful
with this you often have a hard time to find your bookmark back.
I found this problem really annoying, so I designed a very simple textual format
for bookmarks, where I would enter a description, url, and a list of keywords.
I wrote Tengis, a program that can read this format and can quickly query the
bookmarks with keywords. Unfortunately, I never quite got used to using my own
software on top of the browser, and always end up grepping for the file within
emacs.
Here is an example excerpt of a bookmarks file:
Babelfish
http://babelfish.altavista.com
search, languages, translation
Amazon
http://www.amazon.com
search, books, music
Abebooks
http://www.abebooks.com/
search, books
Another problem is that various links end up being stored in documents, text
files which I write when I accomplish some specific task. These do not make it
to the global bookmarks file.
For convenience, I wrote a script that could convert this file in a tree
structure and automatically generate bookmarks files for whatever browser I'm
using at the time.
Whenever I have an idea for a project, something that I find interesting enough,
I document it. I would like to share these documents, but they change quite a
bit over time, and they don't necessarily belong together for the presentation
layer.
There is much information to be acquired when using computers. A good habit
that I have acquired is to start a text file to jot notes whenever I take on a
task that is going to take a few hours. This helps keep my focus organized, and
serves as reference if I have to repeat that task in the future. It is also
very useful to just send those instructions when someone asks me how I
accomplished this task in the past. I also avoid wasting time when I need to
make a new iteration of the same task-- I can review my thoughts at the time,
the decisions I made, etc.
When you are surveying a lot of scientific papers, it is good to take notes on
ideas and to summarize the crux of each paper that you read. This helps
organize your thinking by forcing you to write and express your thoughts. I
always wrote short 5 or 6 paragraph reviews of the papers that I read. These
live in separate files and can sometimes be reused by friends when they ask me
about specific subjects, when I point them to some paper or other.
Also, I like to take down quotes from the books that I read. Whenever I read a
book, I mark down interesting passages, and when I'm done with the reading, I
take 30 mins to copy these passages in text files. I sometimes like to feed
from this body of quotations to add to my signature in email (although I must
admit that I have eliminated using signatures at all for many years now). In
any case, I sometimes enjoy going back to those review files when I'm having an
idea that relates to a book that I have read.
A key theme behind the problems described above, is that the software that you
use to manipulate your personal information or notes files, is going to change.
Therefore it is a bad idea to use closed formats like that produced by MS Word,
or similar software, if you want to be able to maintain and use these documents
for a long time.
I very much trust simple text files. They will always be readable, and
interpretable, and they use little storage. In this context, docutils is an
amazing tool because it allows you to extract meaningful structure from them, as
long as you follow minimal conventions. One of the principal motivators behind
this system is to provide the ability to maintain all sorts of personal
information using simple text files. This is a key aspect.
Simply stated, our goal is the following:
To make it possible for users to create relevant content and allow building
ways to serve it intelligently by providing a semantically rich access to his
data.
We want to make it possible to build services on top of the user's valuable
resource: information. In order to do this, we have to make it possible for any
user to build this meaningful source of his information, to add relevance to it.
We want to:
make it easy to enter the information in a way that allows an automated
system to extract the meaningful chunks of data and associate them with
pre-defined (and extensible) semantics.
This may involve some form of simple markup (e.g. "create new document",
"insert contact info", "insert bookmark"). Easy means simple. The interface
and data format has to be simple, if not trivial;
provide a service that will store this extracted information in a way that is
accessible by various publishing services;
create services that will offer creative views on this data.
You can think of a blog interface, image galleries, a birthday notifier
system, a system to sync your data store with your PDA, to serve your
personal bookmarks as RSS feeds, to publish your travel log, to show your
calendar of events, etc.
These views would create value by providing convenient access and novelty on
top of the user's data source. Each of these views would use as its basis
the parsed data source, stored and access in an efficient manner (i.e. in a
database).
Our aim is clearly NOT to:
- create yet another specialized personal information management system. For
example, we do not want to manage email data. We want to enable the creation
of relevant data, for the user to create value by identifying relevance in
his info, and we need to make this easy;
- create a desktop search system. We do not want to deal with unorganized
information on a user's system, but to provide a convenient space where a user
can consciously organize most of his textual data. Using all the unorganized
information is difficult and a mistake, because there is a lot of non-relevant
junk data;
We believe that relevance in information is the result of a certain amount of
conscious effort from the part of the user, and that search technologies have an
inherent limit in the quality of the information that they can provide, in terms
of filtering and organizing the data that navigates in a user's system. This is
a key aspect of this document and the scope of what we're trying to achieve.
Search can help in organizing, but cannot organize for you. Better search can
alleviate some of the need for organization, but we recognize that ultimately,
to create high-quality content, a conscious effort has to be made.