Building Interoperable Metadata
Hal Abelson, Ben Adida, (Eric Miller?), (Stefano Mazzocchi?)
the outline is below, the paper is in progress here.
Introduction
- as we build more and more content, metadata is more and more important, as it clears the path to the data
- the amount of metadata and the needs to express it are exploding
- one example: the world of tagging and folksonomies is creating
significant user-generated and user-curated simple
metadata. Classification, browsing, and now searching are
guided by this metadata
- new services like Technorati Tag Search are making use of these
tags across sites, though this means they have to build
special adapters to these sites, given each site's API.
- these various communities need ways of making their local data
accessible to the world in a standard way that doesn't cause
an n-squared interoperability problem. A small amount of work
by each site should let it become a "good metadata
citizen."
- currently, these sites use a functional approach to publishing
their metadata. But metadata is data. A functional approach
assumes one way of viewing this data, even if everyone
standardized on the same SOAP API (WSDL), it would be
inappropriate, as there are no side effects to the data
fetching calls, and the data formats remain undefined (unless
we all adopt the same XML schemas).
- we need a data-centric approach for publishing metadata.
- once we have universally accessible tags, we can start building
synonyms across sites with Metatagging.
- these techniques expand very naturally to other relationships
than just tag. The idea is the same for declaring
authors, licenses, structural content relationships, etc...
Our Actors
- Paul, the web Publisher, who manages a photo community site (a-la Flickr).
- Patrick, another web Publisher, who manages a social bookmarking site (a-la Del.icio.us).
- Tim, the tool builder, who wants to consume metadata all around and build interesting output.
- Ursula, a user of Paul's web site.
A Metadata Utopia: Wouldn't it be nice if...
- what would an ideal metadata world look like?
- web publishers and web publishing tool builders:
- ability to use natural methods for expressing metadata: XML and HTML
- ability to define own, independent, metadata concepts, able to model anything the author wants.
- ability to reduce data duplication in metadata expression: renderable data should be structurable, too
- ability to reuse other metadata concepts from other publishers when desired.
- making room for unanticipated concepts as first class citizens.
- add metadata on pretty much anything, without having to
define special formats: orthogonality between subject of
assertion and property of assertion.
- does flat derive from existing properties? It might.
- client tool builders
- access metadata assertions on any web site
automatically, without any prior knowledge of metadata
schema
- observe common metadata properties between publishers
when these publishers happened to use the same metadata
concepts, even if the other metadata concepts are
different.
- how current schemes fail to meet this utopia
- straight XML:
- accessing certain metadata fields is not possible
without knowing the schema.
- reuse of components is possible, but this reuse is
hardly useful as the common elements cannot be accessed
generically: knowledge of enveloping schema is required.
- Technorati:
- searches across sites use custom metadata
parsers. n-squared problem.
- unable to determine metadata concept scope: "chat"
and "chat"
- Microformats:
- this is HTML marked up to become XML-like: microformats is to XML as RDF/A is to RDF.
- all the problems of XML: no reuse of attributes without knowledge of outer schema.
- almost worse: no reuse of attributes within a given site! Unless n-squared transforms between microformats used on a site.
Folksonomies: A First Target for Interoperable Metadata
- user tagging is the first truly successful metadata publication
mechanism.
- whether communally (flickr) or individually used (gmail), tags
map quite well to the human natural method of content
categorization. The immediate feedback approach touted by many
allows strong, meaningful concepts to evolve from a
community of users.
- we're approaching a point where people find it useful to access
data from multiple sources with the same tag:
Technorati Tag Search. As the number of sites that offer
tagging grows, this need will increase dramatically.
- right now, these metasites are using specific, hardwired
interfaces to each site. This is an n-squared problem, it
won't scale. How do we do better? How can we effectively
browse a tag across all sites that support tags?
World of Metadata
Functional or Data?
- more than just humans browsing. Machines, too
- A number of these sites have APIs: Flickr, Delicious,
etc.... These APIs are REST, SOAP, etc...
- All of these APIs are functional, not data-oriented. Meaning they
model actions and queries. Yet the underlying metadata is just
data! Why should there be an additional layer of functionality
above the typical HTTP GET?
- The returned data is still under some site-specific
schema. Unless we adopt the same schema, we've got
issues.
- Can there be, instead, a data solution to this problem? Meaning,
a standard way to express tags (metadata) such that a simple
HTTP GET on the resource we're interested in will yield the
proper metadata?
- On top of a data layer, any API can be layered transparently,
without knowing anything more. The reverse is not true.
- Metadata is just data. We need a universal data access mechanism
such that the query: "give me all tags for this resource" is
generic across sites.
- This is very important for small publishers: you don't need
to support anything other than HTTP GET to get your metadata out
there. Think of blogs, etc... VERY IMPORTANT, think
static-producing content management systems
- So, clearly, we need a data approach to metadata.
How should it be exported?
- a separate set of data? e.g. separate XML files?
- globally-accessible concepts with RDF? Or locally-defined concepts like XML?
- for metadata, XML is sufficient.
- but for INTEROPERABLE metadata?
World of Interoperable and Extensible Metadata
- There is already an abstract approach to expressing data
relationships. It's called RDF.
- RDF simply provides typed links between world-accessible
resources. A URI is linked to another URI under a property
identified by a URI.
- With a simple RDF property: TaggedAs, we can express
tagging statements in ways that everyone can understand. Note
that this RDF property may well be specific to the site, in
case they have their own definition of tagging. So the
property might be
http://flickr.com/rdf/TaggedAs.
- assume a publisher starts out with XML, using some homegrown
document format. RDF statements can then be expressed in XML
documents using GRDDL.
- assume a publisher starts out with HTML, possibly marked up
with class and rel. RDF statements can also be
expressed very simply using RDF/A.
- microformats: GRDLL!
- One can start by building their own internal representation
of metadata, using stylesheet classes in HTML, or their own XML
schema. One can then simply "upgrade" to a globally-accessible
mechanism by using GRDDL, which can perform any XSLT to
"extract" the RDF metadata statements. More interestingly, one
can build in metadata in any XML dialect using RDF/A.
- The beauty of it is that some sites have already started doing
this without realizing it, via the use of RSS 1.0 feeds. These
feeds are based on RDF, which means that an RSS 1.0 feed of
tagged data is universally parsable!
- What about PiggyBank? PiggyBank's idea is to extract RDF from
existing sites, with transforms built into the client. This is a
good stop-gap solution, but it cannot be the end solution as there
is an issue of provenance: did the publisher really imply the
extracted metadata?
- long term, though, having the same data source for human
browsing and machine browsing is most maintainable: microformats
and RDF/A.
- For interoperability across schemas, there's only
RDF. micro-formats and XML schemas inherently prevent
publisher-driven extensibility.
Metatagging: Building a World of Cross-Site Synonyms. Point of View?
- Once tags are universally accessible between sites, then there
is a possibility of building a meta-tagsonomy. This
meta-tagsonomy would allow a bottom-up, user-generated set of
equivalences between different tags and sets of metadata
online.
- Built just like the current tagging schemes, with:
- user-generated links
- immediate feedback on what got linked with a random sample of
elements to confirm similarity
- ability to undo a link
- Need for some new stuff:
- statements should be linked to a user to know who said
this
- count how many users agree with this link
- some links are more endorsed than others.
- the metatagging has fewer core links, but more
fine-grained data on each and how valuable that link is
.
- Some techniques already heading this way: Stefano's blog post.
- This eventually generates a "point of view" on the world.
Extending the Technique to Other Metadata
- RDF can be used to express metadata of arbitrary
complexity.
- Examples:
- Creative Commons moving from cc:license to
dc:license
- DSpace GRDDL'ization
- Libraries
- Gene Information
Issues
- RDF/A: requires XHTML2
- GRDDL: requires modification of HEAD
Current Sites and What They're Doing
Flickr
Flickr uses a REST
API which returns XML datasets.
Del.icio.us
Del.icio.us is a collaborative bookmark manager. Users tag links they
want to bookmark. Links are browsable by user, by tag, by
combination, etc..
Del.icio.us provides a REST
API to fetch data by tag, user, etc...
Most interestingly, it also provides an RSS feed in RSS 1.0,
which is RDF. This means that there are in fact RDF statements
being made about bookmarks on Del.icio.us with tags being
assigned RDF resources,
http://del.icio.us/tag/<tag>. One hook is
missing, though, which is that the RDF statements about the tag
are actually made at
http://del.icio.us/rss/<tag>, but that is
probably fixable easily.
Technorati
Technorati shows you search results from Flickr, Furl and Delicious
which have the tag you're looking for. The adaptation to each
site is done by a specific plugin.