Steven Johnson has an interesting post about some tools he uses to accelerate his writing process. In there, he mentions this:
So the proper unit for this kind of exploratory, semantic search is not the file, but rather something else, something I don't quite have a word for: a chunk or cluster of text, something close to those little quotes that I've assembled in DevonThink. If I have an eBook of Manual DeLanda's on my hard drive, and I search for "urban ecosystem" I don't want the software to tell me that an entire book is related to my query. I want the software to tell me that these five separate paragraphs from this book are relevant. Until the tools can break out those smaller units on their own, I'll still be assembling my research library by hand in DevonThink.I wonder whether it might be possible to have software create those smaller clippings on its own: you'd feed the program an entire e-book, and it would break it up into 200-1000 word chunks of text, based on word frequency and other cues (chapter or section breaks perhaps.)
Columbia University Press publishes the Columbia Guide to Online Style, and wanted to put together a guide to how to publish using digital tools and media (rather than what to publish). The objective for the CGDP was to produce a book and a web site in tandem, with the content for both being shared.
I believe this idea sprang from a conversation between Laura Fillmore, President of Open Book Systems, and Bill Kasdorf, President of Impressions (since bought by Apex CoVantage). OBS has produced various content management systems for years, having grown out of the print publishing arena. (OBS published one of the first books discussing the Internet, in fact.) Impressions was also invested in online publishing, with significant research devoted to the use of XML as a platform for content sharing and reuse. The folks at Columbia University Press, with Stephen Sterns at the helm, bravely decided to give it a whirl.
So the objectives of the project were to:
To cut a long story short, it became clear that we needed a way to manage text at a much lower level than "chapter". To ensure users could save any span of text they wished, we needed a way to describe that span. So, between us, we came up with the idea of "chunks".
A "chunk" was defined as a single paragraph. Within that paragraph, we could reference other chunks to create cross-references. We could allow users to save chunks, or a span of chunks, to "My Guide". This met each of the requirements beautifully. Suddenly we could control the entire print publishing mark-up process online, and publish that same content in a dynamic fashion. Authors could update chunks rather than sections or chapters, and users could be notified of updates to specific sections of text.
The print publishing process was also pretty much automated:
The one thing we did not do was provide metadata for each chunk of text (which is what Steven Johnson suggests would be useful for complete semantic searching of texts). We didn't do that because getting content from the authors was hard enough without asking them to describe every paragraph in their text. It could have been added to the online version over time, I suppose, but that was never done. Actually, it would be interesting to apply a wiki to text chunks, to allow individuals to collaboratively describe each chunk from their own point of view, and slowly build the complete metadata based on that.
Regardless, I think the project really did break some very interesting ground -- ground worth exploring further, as Steven suggests. No small congratulations should be given to Laura, Bill, and Stephen, and also to Mircea Baciu, the OBS developer who implemented these crazy ideas.
You can ping this entry by using http://www.cassidys.org/mt/mt-tb.cgi/63 .
On the metadata issue, one of the interesting features of the "Guide" involves the integration of the back of the book concept index into the online text, which creates some rich possibilities.
Imagine, if when reading a book you could "see" the index at the same time as you are reading the text. This "transparent index" feature could weight and organize the content in an important way. For example, if when reading the online version of "Origin of Species" by Charles Darwin, you could see the words
"I have called this principle, by which
each slight variation, if useful, is preserved,
by the term Natural Selection."
highlighted as an index "main entry," and the words were linked to the word "evolution" in the index, you would know that you should pay particular attention to that sentence. (In good indexes, there are only 4-5 page numbers following any main entry.) We readers of paper have become used to using indexes backwards!