Sunday, May 29, 2011

USENET style scoring applied to search and browsing.

With "content creation" becoming a relatively common type of business on the net, search engines are overflowing with rubbish content.  Content creation mills analyze search traffic, look for terms that are frequently searched for, but for which will produce few good hits in search engines.  They then pay someone a small fee to create some "content" that can be found by search engines and slap ads on it to make money.

The "content" is bait and as with all bait it is unfit for human consumption.  The point of the exercise is to lure the user to click on the ads.  You can imagine that when someone gets paid a couple of dollars to write an article about, say, ball bearings (to pick a random topic), yet has absolutely zero knowledge about ball bearings, the content that is being created won't be of high quality.

And there are lots of people and companies that make a decent living out of creating rubbish content.

Search engines have value when users are happy, but since search engine companies also sell ads they have a balancing act to perform:  the content produced by content mills produces ad revenue, but at the cost of the search experience.  Since content mills have become so numerous and "fake" content is so prolific it is becoming a real problem.

This leads to the question of whether filtering is worth doing.  What if we did the opposite and were able to cherry-pick content instead?  Here are some thoughts:

  • An author (blogger, journalist, writer) has a public key pair.  This is used to sign content and thereby affirming authorship of the content.  This would provide the writer with a verifiable identity and the mechanism for associating content with its author.
  • A public key pair may or may not be used to identify a real person.  It can just as easily be used to affirm that content is produced by the same anonymous source.  The problem we want to solve is not to verify the real identity of the author.
  • A given piece of content can have multiple signatures attached to it.
  • When you read a blog posting or an article, you have three choices.  You can mark it as good, neutral or bad.  The neutral tag means that you'd like to start tracking this author, but you are not making any value judgement at this time.
  • Your choices can be stored locally, they can be stored in a third party service or they can be stored with the search engine.  These mechanisms have their pro's and cons.
  • When you perform searches the search engine can include the verified key id in the search results. This can then be used either by your browser to do local scoring and ranking within the browser, or the search engine can be privy to your preferences and do the scoring and ranking for you.  Known bad sources will be omitted, known good sources might get boosted.
  • Since it will take you forever to build a significant personal database of "known good" and "known bad" authors, it would be obvious to look at collaborative systems for author ratings. For instance you might want to share ratings with people you trust.
The main objective is to provide you with a simple, non-intrusive mechanism for getting some hints about content quality.  It also opens up a wide range of possibilities for collaborative filtering.

I haven't really thought much of this through in much detail, but it is an interesting idea.  Aspects of it is not unlike scoring USENET postings in news readers that support scoring and filtering -- a feature that is sorely missed on the web and especially when performing searches.

As with all half-baked ideas, there is a likelihood that I have overlooked some fundamental problem or problems.

One thing that is immediately obvious is that the mechanism for signing content, identifying what content on a page is signed and what isn't, has to be made simple and robust.  This is a challenge because content is often published in a dynamic web page -- where the contents of the page changes with each request.

There is also the risk of using signed content as bait.  Say a page contains a signed article from a "known good" author (as per your preferences) in 2pt fonts near the footer and the rest of the page is pure rubbish.  What then?  It might be possible to have some browser tools that will look at the surface area of rendered content on the page and warn the user if it looks as if this baiting is happening (perhaps giving the user the option of adding the page to a personal or shared blacklist?).  I don't know, but this presents an interesting challenge.

It must also be possible to retrofit this to existing technologies with minimal disruption and complexity.  If you have ever had to write code that signs XML you know that this is horribly complicated and brittle.  Perhaps simpler mechanisms can be arrived at.

In any case, I would love to see someone actually prototype some of these ideas to see if they can be made to work or if it is completely unfeasible.  I like the idea of having a mechanism for recording my personal preferences and influence my personal search results from these preferences.  I would also love to have a compact UI element that shows me a summary of what I think of the author(s) of a given page (sort of like the SSL indicator in the address bar).

If you are a student at a university looking for a research topic, this might be something for you.

Friday, September 4, 2009

Documents and directories as search expressions

Searching is tedious, searching for stuff that is not on the top of your mind is even more tedious. However, when you have read something and find it interesting it is sometimes interesting to follow up on that issue over time. The trouble is that there is never time to do that.

What if instead you could mark some document, paragraph or whatever and say "This is interesting, when you find something related to what's written here, tell me but only if you thing I'd be really interested". Wouldn't that be nice?

How could that be done? Well, of course it has already been done in a fashion, so I'll just point out that Henry Lieberman at MIT made something called Letizia that uses the browse history to offer suggestions to the user. Alexandros Moukas made something called Alamathea that does much of the same, but with the interesting difference that he uses a herd of intelligent agents that compete for the user's attention when they are trying to find interesting stuff.

One way of implementing the "document as search expression" thingy would be to apply an Alamathea-like mechanism to it. It would look for words and concepts listed in the document and then evolve over time to offer relevant advice.


Sunday, May 10, 2009

Solving bookmarking (by reinventing the problem)

Bookmarking has never really worked well in browsers. For me it is down to two issues:
  • Bookmarking systems do not address my fundamental problem of finding pages I am looking for.
  • They require some form of editing or organizing.
Breadcrumbs.

A few years ago I spent a couple of weekends thinking about what I wanted in a bookmarking system and I sketched out a system I called "breadcrumbs". This idea was based on the observation that very often I would like to recover a web page that I have seen before. The way I often go about doing this (manually) is by association: I try to remember what I was doing when I saw a particular page. What lead up to me seeing this page, what I was doing, what I was thinking etc.

For instance I might remember that I saw a particular article on Reddit a few days ago. So I'll go back to reddit and try to perform a search. Say I was reading an article about "ruby" that I found particularly interesting. The problem is that if you go back to Reddit and perform a search for "ruby", you will get a lot of hits and since Reddit has no way of understanding your intent ("find this article I have already seen"), it will take some amount of luck for the right article to be ranked up.

I could look at my browsing history and try to search there, but if your browsing is anything like mine, not only will your browsing history be rather bulky, but it will be distributed across a number of different machines.

Services like Delicious are not appropriate for this. They are more geared towards links you know you would like to save and share and thus require more of an effort (editing, tagging, saving).

While the social aspect is certainly interesting, it isn't really a primary feature of a system for finding your way back through your browsing history. Yes, you can probably weave in social features here, but it would be a bit premature before the fundamental problem of recovering context as been solved well.

And of course, it needs to be fast and non-intrusive to your normal workflow.


So how would it work?

In order to have the data you need, my assumption was that you would need to be able to not only collect data on all the pages you view through your browser, but to be able to collect timing information, subte feedback from the user and context information.

Browsing through proxy.

The easiest first step to accomplish this would be to make a simple proxy server through which you browse the web. All your browsing happens through this proxy server. This way you can record a log, analyze all content you see and index it. (Don't worry about the architecture of the system at this point. There are plenty of design alternatives here. Just think of this as a starting point).

Instrument the browser.

In addition you would need to instrument the browser so that you can record how you interact with the pages. You want to discover if the user is actually reading the page, scanning the page, skipping the page, if the page is open but obscured, what other pages are open in different tabs/windows, if the user is going back and forth between different views etc. You probably want to classify pages into various categories. For instance I spend a lot of time in GMail, but GMail is more of an application than a page so my interaction with it should be interpreted differently. On the other hand, New York Times is more of a content source than an interactive application, so when I spend a lot of time there, it means something else.

Minimal interaction.

Finally, I need some way in the browser of signalling my interest in a page or adding information about it. My original theory was that a simple three-value button might do:

"+ o -",

meaning "+" = "remember as good", "o" = "remember" and "-" = "remember as bad" -- the latter meaning "I won't be needing to see this page again" or "remember as crap". This would replace the traditional bookmarking in browsers. Additionally the user might want to have a simple (quick!) UI for attaching notes and possibly add tags.

On tags.

Now a lot of people like tags, and I too think they are a good idea. They dispense with the erroneous idea that people want to, or can, deal with rigid classification systems. However, systems that do offer tagging usually aren't very clever about it. If you are making a tagging system then please try to design a more intelligent user interface. If memory serves, I think Adobe Lightroom had some decent features here. It can suggest tags based on context, thus making it easier to add relevant tags (and more tags) easily based on what can be inferred from the context and associations between tags.

But tags represent just additional information in this system and are not important to the core idea.

Mimicking memory.

The core idea is that once you have gathered all this data, you can begin to mimick the way your memory works by constructing contexts and associative links between pages and clusters of pages. The associations would be distilled from your behavior when looking at the pages as well as the page content. What order you read them in, what links you followed to get there, what content commonalities the pages have etc. By analyzing content and temporal characteristics you would be able to detect "topics" or "threads of activity".

Recovering a page that you know you have seen would then be a process of searching for these associations. For instance you could start with a search which can then be refined interactively as you indicate what ranking factors you would like to boost. Or you could start with a page and get a visualization of what other pages you are associating with this page. These associations can be temporal, they can be based on navigational patterns, thematic overlap, text similarity etc.

I think it is probably a fair assumption that the process would be iterative. It should allow you to rapidly refine your search criteria to prune the solution space in a matter of seconds.

It is important to reiterate that this is not an objective process -- it is about you, your memory and your associations. Thus it is not a "social" activity and I think bringing social aspects into this system prematurely only confuses the issue.

Other work.


I can remember a friend of mine, Bjørn Remseth, doing some work in this area at about the same time I was playing with the "breadcrumbs" idea. I remember he outlined some ideas on how to make your work environment more context aware. For instance by having a side-bar on your screen that attempts to figure out your current context and discreetly show information relevant to your current context. For instance links to pages that are related to what you are doing, a brief list of programs you might want to launch next (a'la Sapiens) etc.

How far I got.

I did implement a proxy server that would log my browsing, store the web pages I was looking at and which provided me with a simple framework for analyzing web pages both in real time and in batch. What I did in real time was to try to identify "threads of activity" by looking at web pages that were consumed together (in time) and try to identify separate activities. Both by analyzing temporal characteristics and content. I saved these as weighted (and annotated), directed edges in a graph structure, so for a given page I could recall all edges coming from it or going into it. (Note that an edge was usually equivalent to following a link, but it could also occur if I typed in an URL and the system deemed this page to be part of an ongoing "activity").

I kept the full contents of web pages for N days and then, by way of a periodic pruning process, reduced the pages to tables of occurrence counts as pages got older -- meaning I used traditional TF-IDF to filter terms and for each page ended up with a much more compact form that represented the important content on the page. (Actually, upon crawling/downloading I generated the occurrence tables and saved them, so physically I was just removing web pages from the database. I kept the documents for N days just to have some leeway in adjusting thresholds for what to filter from the occurrence tables).

(The frequency dictionaries were generated from a large web corpus in 2002 or so. I can't remember if I had dictionaries for each language or even how I solved the problem of languages. A proper solution would need per-language dictionaries and require a lot more emphasis on stemming and lemmatization).

To further augment these occurrence tables I mixed in "synthetic" documents or pseudo-documents. Documents that are the union of several documents and which have synonyms, hyponyms etc added to them to make it easier to find pages by topic rather than exact search terms. This proved an interesting idea, but I didn't have the chance to spend enough time refining the idea and figuring out how you could leverage these synthetic documents. Limited testing suggested that they were useful for quickly navigating to a certain cluster of pages and that they were useful for the realtime activity identification.

I never got around to finishing the browser integration and it was while fiddling with this more pressing matters required my attention and I forgot about the whole (spare time) project.