Helping Data Scientists Navigate Big Data with the Semantic Web / Blogs / Perficient

The term “Big Data” is being thrown around a lot lately, but what is it, and what makes “Big Data” different from the other data that wework with in the healthcare industry? Big Data is differentiated by its volume, velocity, and variety. The last point, variety, deals not only with the wide variety of formats that data may reside in, but also with the varying nomenclature and semantics of data assets.

Enter the Data Scientist

Big Data holds massive potential to uncover new insights and trends and has led to a new role – that of the Data Scientist. Data Scientists are (rare) people with statistical, data modeling, and programming backgrounds. The Data Scientists also have to be able to not just do number-crunching, they need to be able to use the data and visualize it to tell a story to executives in order to change strategy, identify new market opportunities, determine how to better maintain population health for ACO, or what have you. As you can imagine, these people will be highly paid.

The Future of Big Data

With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.

Get the Guide

So now you’ve embarked on a Big Data program – do you want these highly paid individuals to be spending their time trying to find and integrate data? While large enterprises may have petabytes of data, these are typically spread out in numerous silos – very often each with its own nomenclature and varying semantics which makes it very difficult to integrate data.

Wouldn’t it be useful to Data Scientists (and a lot of other folks!) to be able to find information, regardless of how it is named, and query the data without having to install a lot of database clients, and tie it together in a single query platform where you could automatically be assured of globally unique identifiers, and be able to have the computer draw as many inferences as possible (so that a lot of custom programming doesn’t have to be performed)? If you think this would be extremely useful – Semantic Web technologies are for you!

Mining the Web of Data

The Semantic Web is a “web of data” – instead of a web of documents (which is what the WWW is currently). Semantic Web technologies can of course be used on your intranet to traverse the web of your enterprise data to help bridge the semantic divide. The foundations of the Semantic Web are RDF and OWL – with RDF triples you have a common logical model of subject, predicate, and object. Data can be converted to RDF triples directly or converted to triples on the fly such as with D2R, which converts RDBMS data into RDF triples at query time. Ontologies and business rules, expressed using RDFS, OWL, SWRL, etc., help to bridge the semantic divide and enable machine inferencing/learning. For example, the Translational Medicine Ontology (TMO) asserts that (for its purposes), the class “diseases” is equivalent to “condition” and “side_effects” classes, i.e., so that analysis of diseases would automatically (with inferencing) include the members of the condition and side effects classes.

<http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseasome/diseases> owl:equivalentClass <http://data.linkedct.org/resource/linkedct/condition>.

<http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseasome/diseases> owl:equivalentClass <http://www4.wiwiss.fu-berlin.de/sider/resource/sider/side_effects>.

Using Semantic Web technologies, Data Scientists can be more productive. With SPARQL (query language for the Semantic Web), Data Scientists can formulate powerful federated queries informed by machine inferencing capabilities (using SPARQL 2 path properties or by accessing stored inferenced data generated by a reasoner such as pellet or Jena). SPARQL can also enable pulling the varied and disparate data together into a denormalized format (since many Big Data platforms don’t have join capabilities) which can then distributed in Hadoop or other Big Data platform for high performance data mining, if your Semantic Web platform isn’t able to achieve the performance required on massive amounts of data. The results of the data mining might be converted into RDF triples to enable a feedback loop. Semantic Web technology is beginning to scale dramatically – one triple store vendor has asserted ability to store 1 trillion triples, so much more analysis will be able to be performed directly (and more easily) using your Semantic Web platform.

Semantic Web technologies can be a very powerful tool in your Big Data arsenal, and helps to address the semantic variety of data and helps Data Scientists to traverse the information silos.

Helping Data Scientists Navigate Big Data with the Semantic Web

by Pete Stiglich on April 30th, 2012 | ~ minute read

The Future of Big Data

Tags

Leave a Reply

Pete Stiglich

Categories

Follow Us