Wednesday, December 19, 2012

Data Semantics and Semantic Web - 2012 year-end reflections and prognosis


Growing role of background knowledge:  Semantic Web researchers and early entrepreneurs knew (as exemplified by this first patent on Semantic Web technologies filed in 2000 that with moderate effort, it is possible to create background knowledge and populated ontologies by aggregating and disambiguating high quality information and facts from multiple sources. It has also been long known that by using such knowledge bases, we can substantially improve information extraction and develop a variety of semantic tools and applications including semantic search, browsing, personalization, advertisement, etc. Over the past 3-5 years, several efforts to create such knowledge bases took place, of which Freebase is a showcase. What has drawn everyone’s attention to this aspect of semantic approach is Google's acquisition of the company that created Freebase and significantly extending techniques largely known, but scaling it to the next level, to create Google Knowledge Base (GKB).  Further on, applying GKB to enhance search (and I am sure other applications in future), has forever changed the importance of creating and using background or domain models for semantic applications. I believe this form of semantic application building will see the fastest growth in the near future. I have discussed related thoughts in my article titled  “Semantics Scales Up”.

Growing pains for Linked Open Data (LOD): Publication of over 300 large data sets with 30+ billion triples certainly draws the attention of many.  Data holders will continue to find LOD an attractive vehicle to publish and share their data, so it will continue to grow at a rapid pace. Some of the data sets, more than others, will find additional usage as data reference, interlinking, and transformation. But in the near term, broader or aggregate usage of LOD will be a slog because we are running into some of the harder technical challenges: questionable quality of data and provenance, unconstrained and uneven use of semantics (e.g. same-as used inconsistently) and limited use of richer relationship types (part-of relationship, causality), and poor interlinking (lack of high quality alignment).  We will need to have better handle of these issues along with a better ability to identify the most relevant and high quality data sets (a semantic search for LOD) and better alignment tools (not limited to just same-as), before we can start realizing the true promise of LOD. So, I would give it another five years to fully develop.

Democratization of Semantics: So far, we have paid the majority of our attention to knowledge representation, languages, and reasoning. Furthermore, a majority of the work focuses on documents in enterprises and on the Web, or uses structured data transformed into triples. But, what is even more exciting, is how semantics and Semantic Web technology (primarily through annotating data with respect to background knowledge or ontologies) is being used for improving interoperability and analysis of different types of textual and non-textual data, esp. social data and data generated by sensors, devices, or Internet of Things.  These types of data have long overtaken traditional document-centric data and structured databases in terms of volume, velocity, and variety. The type of semantics one needs to deal with for such (relatively) nontraditional data is of amazing variety.  For example, in the Twitris system, besides semantic annotation for spatial, temporal, and thematic elements associated with the tweets, semantics (aka meaning) also includes understanding people (about the poster and receiver), network (about interactions and flow of message), sentiment, emotion, and intent.   For more in-depth treatment, see our just published book on semantics empowered Web 3.0. This is probably the most important development in my view and is likely to garner a much larger share of attention related to the application of semantics and semantic web technologies.



December 19, 2012

ps: parts of this appear in: Semantic Tech Outlook: 2013