February 01, 2005

Text chunking

Posted at February 1, 2005 04:06 PM in Technology .

Steven Johnson has an interesting post about some tools he uses to accelerate his writing process. In there, he mentions this:

So the proper unit for this kind of exploratory, semantic search is not the file, but rather something else, something I don't quite have a word for: a chunk or cluster of text, something close to those little quotes that I've assembled in DevonThink. If I have an eBook of Manual DeLanda's on my hard drive, and I search for "urban ecosystem" I don't want the software to tell me that an entire book is related to my query. I want the software to tell me that these five separate paragraphs from this book are relevant. Until the tools can break out those smaller units on their own, I'll still be assembling my research library by hand in DevonThink.

I wonder whether it might be possible to have software create those smaller clippings on its own: you'd feed the program an entire e-book, and it would break it up into 200-1000 word chunks of text, based on word frequency and other cues (chapter or section breaks perhaps.)


This reminded me of how we went about producing the online version of the Columbia Guide to Digital Publishing, which used the "chunking" approach in 2002/2003.

Columbia University Press publishes the Columbia Guide to Online Style, and wanted to put together a guide to how to publish using digital tools and media (rather than what to publish). The objective for the CGDP was to produce a book and a web site in tandem, with the content for both being shared.

I believe this idea sprang from a conversation between Laura Fillmore, President of Open Book Systems, and Bill Kasdorf, President of Impressions (since bought by Apex CoVantage). OBS has produced various content management systems for years, having grown out of the print publishing arena. (OBS published one of the first books discussing the Internet, in fact.) Impressions was also invested in online publishing, with significant research devoted to the use of XML as a platform for content sharing and reuse. The folks at Columbia University Press, with Stephen Sterns at the helm, bravely decided to give it a whirl.

So the objectives of the project were to:

  1. Publish a book with timely, accurate information
  2. Publish a web site with timely, accurate information, and keep it updated over time
  3. Share the content between the media
  4. At some later date, publish a second edition of the book, using the content that will have been maintained and published on the web site
  5. Structure the content such that it could be reused on a micro basis, e.g. paragraph syndication
  6. Enable online users to bookmark pieces of text that they wanted to a "My Guide" area

To cut a long story short, it became clear that we needed a way to manage text at a much lower level than "chapter". To ensure users could save any span of text they wished, we needed a way to describe that span. So, between us, we came up with the idea of "chunks".

A "chunk" was defined as a single paragraph. Within that paragraph, we could reference other chunks to create cross-references. We could allow users to save chunks, or a span of chunks, to "My Guide". This met each of the requirements beautifully. Suddenly we could control the entire print publishing mark-up process online, and publish that same content in a dynamic fashion. Authors could update chunks rather than sections or chapters, and users could be notified of updates to specific sections of text.

The print publishing process was also pretty much automated:

  • Text was saved as chunks, each with a unique ID
  • The structure of the text was described by chunk ID sequences
  • Cross-references were targeted from and to specific text spans (areas within chunks)
  • The CMS exported the entire structure on a chapter-by-chapter basis in well-formed XML, that complied to the print publisher's schema for their final typesetting system
  • The typesetting system imported the XML, applied styles to it automatically, and the layout tweaked using manual tools
  • The book was sent to print

The one thing we did not do was provide metadata for each chunk of text (which is what Steven Johnson suggests would be useful for complete semantic searching of texts). We didn't do that because getting content from the authors was hard enough without asking them to describe every paragraph in their text. It could have been added to the online version over time, I suppose, but that was never done. Actually, it would be interesting to apply a wiki to text chunks, to allow individuals to collaboratively describe each chunk from their own point of view, and slowly build the complete metadata based on that.

Regardless, I think the project really did break some very interesting ground -- ground worth exploring further, as Steven suggests. No small congratulations should be given to Laura, Bill, and Stephen, and also to Mircea Baciu, the OBS developer who implemented these crazy ideas.

Trackback

You can ping this entry by using http://www.cassidys.org/mt/mt-tb.cgi/63 .

Comments

On the metadata issue, one of the interesting features of the "Guide" involves the integration of the back of the book concept index into the online text, which creates some rich possibilities.

Imagine, if when reading a book you could "see" the index at the same time as you are reading the text. This "transparent index" feature could weight and organize the content in an important way. For example, if when reading the online version of "Origin of Species" by Charles Darwin, you could see the words

"I have called this principle, by which
each slight variation, if useful, is preserved,
by the term Natural Selection."

highlighted as an index "main entry," and the words were linked to the word "evolution" in the index, you would know that you should pay particular attention to that sentence. (In good indexes, there are only 4-5 page numbers following any main entry.) We readers of paper have become used to using indexes backwards!

Posted by laura fillmore at March 25, 2005 04:16 PM

Post a comment










Remember personal info?