Sunday, May 10, 2009

Solving bookmarking (by reinventing the problem)

Bookmarking has never really worked well in browsers. For me it is down to two issues:
  • Bookmarking systems do not address my fundamental problem of finding pages I am looking for.
  • They require some form of editing or organizing.
Breadcrumbs.

A few years ago I spent a couple of weekends thinking about what I wanted in a bookmarking system and I sketched out a system I called "breadcrumbs". This idea was based on the observation that very often I would like to recover a web page that I have seen before. The way I often go about doing this (manually) is by association: I try to remember what I was doing when I saw a particular page. What lead up to me seeing this page, what I was doing, what I was thinking etc.

For instance I might remember that I saw a particular article on Reddit a few days ago. So I'll go back to reddit and try to perform a search. Say I was reading an article about "ruby" that I found particularly interesting. The problem is that if you go back to Reddit and perform a search for "ruby", you will get a lot of hits and since Reddit has no way of understanding your intent ("find this article I have already seen"), it will take some amount of luck for the right article to be ranked up.

I could look at my browsing history and try to search there, but if your browsing is anything like mine, not only will your browsing history be rather bulky, but it will be distributed across a number of different machines.

Services like Delicious are not appropriate for this. They are more geared towards links you know you would like to save and share and thus require more of an effort (editing, tagging, saving).

While the social aspect is certainly interesting, it isn't really a primary feature of a system for finding your way back through your browsing history. Yes, you can probably weave in social features here, but it would be a bit premature before the fundamental problem of recovering context as been solved well.

And of course, it needs to be fast and non-intrusive to your normal workflow.


So how would it work?

In order to have the data you need, my assumption was that you would need to be able to not only collect data on all the pages you view through your browser, but to be able to collect timing information, subte feedback from the user and context information.

Browsing through proxy.

The easiest first step to accomplish this would be to make a simple proxy server through which you browse the web. All your browsing happens through this proxy server. This way you can record a log, analyze all content you see and index it. (Don't worry about the architecture of the system at this point. There are plenty of design alternatives here. Just think of this as a starting point).

Instrument the browser.

In addition you would need to instrument the browser so that you can record how you interact with the pages. You want to discover if the user is actually reading the page, scanning the page, skipping the page, if the page is open but obscured, what other pages are open in different tabs/windows, if the user is going back and forth between different views etc. You probably want to classify pages into various categories. For instance I spend a lot of time in GMail, but GMail is more of an application than a page so my interaction with it should be interpreted differently. On the other hand, New York Times is more of a content source than an interactive application, so when I spend a lot of time there, it means something else.

Minimal interaction.

Finally, I need some way in the browser of signalling my interest in a page or adding information about it. My original theory was that a simple three-value button might do:

"+ o -",

meaning "+" = "remember as good", "o" = "remember" and "-" = "remember as bad" -- the latter meaning "I won't be needing to see this page again" or "remember as crap". This would replace the traditional bookmarking in browsers. Additionally the user might want to have a simple (quick!) UI for attaching notes and possibly add tags.

On tags.

Now a lot of people like tags, and I too think they are a good idea. They dispense with the erroneous idea that people want to, or can, deal with rigid classification systems. However, systems that do offer tagging usually aren't very clever about it. If you are making a tagging system then please try to design a more intelligent user interface. If memory serves, I think Adobe Lightroom had some decent features here. It can suggest tags based on context, thus making it easier to add relevant tags (and more tags) easily based on what can be inferred from the context and associations between tags.

But tags represent just additional information in this system and are not important to the core idea.

Mimicking memory.

The core idea is that once you have gathered all this data, you can begin to mimick the way your memory works by constructing contexts and associative links between pages and clusters of pages. The associations would be distilled from your behavior when looking at the pages as well as the page content. What order you read them in, what links you followed to get there, what content commonalities the pages have etc. By analyzing content and temporal characteristics you would be able to detect "topics" or "threads of activity".

Recovering a page that you know you have seen would then be a process of searching for these associations. For instance you could start with a search which can then be refined interactively as you indicate what ranking factors you would like to boost. Or you could start with a page and get a visualization of what other pages you are associating with this page. These associations can be temporal, they can be based on navigational patterns, thematic overlap, text similarity etc.

I think it is probably a fair assumption that the process would be iterative. It should allow you to rapidly refine your search criteria to prune the solution space in a matter of seconds.

It is important to reiterate that this is not an objective process -- it is about you, your memory and your associations. Thus it is not a "social" activity and I think bringing social aspects into this system prematurely only confuses the issue.

Other work.


I can remember a friend of mine, Bjørn Remseth, doing some work in this area at about the same time I was playing with the "breadcrumbs" idea. I remember he outlined some ideas on how to make your work environment more context aware. For instance by having a side-bar on your screen that attempts to figure out your current context and discreetly show information relevant to your current context. For instance links to pages that are related to what you are doing, a brief list of programs you might want to launch next (a'la Sapiens) etc.

How far I got.

I did implement a proxy server that would log my browsing, store the web pages I was looking at and which provided me with a simple framework for analyzing web pages both in real time and in batch. What I did in real time was to try to identify "threads of activity" by looking at web pages that were consumed together (in time) and try to identify separate activities. Both by analyzing temporal characteristics and content. I saved these as weighted (and annotated), directed edges in a graph structure, so for a given page I could recall all edges coming from it or going into it. (Note that an edge was usually equivalent to following a link, but it could also occur if I typed in an URL and the system deemed this page to be part of an ongoing "activity").

I kept the full contents of web pages for N days and then, by way of a periodic pruning process, reduced the pages to tables of occurrence counts as pages got older -- meaning I used traditional TF-IDF to filter terms and for each page ended up with a much more compact form that represented the important content on the page. (Actually, upon crawling/downloading I generated the occurrence tables and saved them, so physically I was just removing web pages from the database. I kept the documents for N days just to have some leeway in adjusting thresholds for what to filter from the occurrence tables).

(The frequency dictionaries were generated from a large web corpus in 2002 or so. I can't remember if I had dictionaries for each language or even how I solved the problem of languages. A proper solution would need per-language dictionaries and require a lot more emphasis on stemming and lemmatization).

To further augment these occurrence tables I mixed in "synthetic" documents or pseudo-documents. Documents that are the union of several documents and which have synonyms, hyponyms etc added to them to make it easier to find pages by topic rather than exact search terms. This proved an interesting idea, but I didn't have the chance to spend enough time refining the idea and figuring out how you could leverage these synthetic documents. Limited testing suggested that they were useful for quickly navigating to a certain cluster of pages and that they were useful for the realtime activity identification.

I never got around to finishing the browser integration and it was while fiddling with this more pressing matters required my attention and I forgot about the whole (spare time) project.

4 comments:

  1. It would be interesting to know what the toolbar data from the likes of Google and Microsoft is used for in this area. They probably cannot, due to privacy concerns, target individual users so their data is anonymized. But still. What if I would tell the browser/toolbar to post behavior-data to a cloud service that does something along the lines of what you're describing... I don't think I'd have too many concerns with that provided the service was good enough.

    ReplyDelete
  2. The privacy issue is a prickly one. In my prototyping efforts I wrote a relatively simple HTTP proxy which I could then hook whatever I needed into. The advantage was that I could run all my browsing through the same proxy from different machines.

    I think having more of it built into the browser might be a better option, but in for this to work properly across all machines you use, you would probably need to persist the data in one place so all your browsers could share the same backend -- and then the issue of trust pops up again. People like me have their own servers. Joe User doesn't and would need to keep data "in the cloud". Or accept that each machine is "an island". (For some scenarios that might be desirable -- for instance if you want clear separation between work and personal contexts. But it is inflexible).

    ReplyDelete
  3. I wouldn't be very surprised if Chrome comes up with some community browsing features in a forthcoming release for enhancing user experience. The problem however is how to capture the essence of "interesting". Nevertheless, even if there's some easy way of pulling up "related" articles that would be great enhancement.

    ReplyDelete
  4. @Anoop a friend of mine did some research work in this area. One of the ideas was to have a non-intrusive "side display" which would produce a ranked list of suggestions for other things the user has not yet seen that might be of interest. I think someone made a prototype for Emacs that does this.

    ReplyDelete