2

Hi, I have two resources that have the same rdfs:label, and I would like know if I am talking about the same resource. A method to disambiguate is to check if they share the same properties/values.

Thus, I have two questions:

  1. Is it possible to use Graphs (Jena) to compare these two resources and evaluate if they are the same or not. How?
  2. Are there other methodologies to solve this issue?

Thanks in advance.

flag

5 Answers

3

There is a Python library that is written for this kind of thing: Silk: A Linking Framework for the Web of Data.

You write a short specification in n3, where you define the matching properties, the similarity metrics, the weights for different properties, etc. and Silk will match the resources in two graphs, either locally or over SPARQL.

link|flag
That's exactly what is needed. Thanks! I will give it a try. – Ale Jun 18 at 7:07
2

One way to solve this using Jena would be to construct the sub-graph rooted in the resource, but make the subject a blank node. You could do this either through the API, or via a CONSTRUCT query:

construct { _:s ?p ?o } where {<http://your.resource/here> ?p ?o}

(noting that this doesn't do the b-node closure, so bear that in mind). Put each of these sub-graphs into its own Model, then use Model.isIsomorphicWith() to check if the two graphs are now equivalent. You need to change the subject resource to a blank node, otherwise isIsomorphicWith will immediately report that the models are not isomorphic.

The other question you may want to ask yourself is whether, for the purposes of your application, two resources R0 and R1 are the same if R1 has a strict subset of the properties of R2? If so, the above solution won't work.

link|flag
Thanks for the answer!, However, the problem is indeed to be able to compare bNodes as well. But, I like the idea of creating a model with the subject as blank node. – Ale Jun 17 at 12:47
2

An alternative/complementary (and interesting) approach would be to ask people and use crowdsourcing or games with a purpose (a.k.a. ESP games) either to provide a feedback to improve your matching algorithm or have humans do the job for you.

For example, see Freebase's RABJ:

"As Victor develops techniques to link the records in his dataset with Freebase he needs oversight to evaluate how his algorithms are doing. Humans excel at reconciliation i.e. understanding if two pieces of information describe the same concept. For instance, determining that the OpenLibrary page about Eugene O’Neill describes the same concept as the Freebase topic about a person who was born on 16 October 1888 in New York is a reconciliation task. Victor generates candidate pairings between the entities in his dataset and their likely Freebase counterparts. These pairings are submitted to the MatchMaker2 application (see Fig. 3) for evaluation. MatchMaker2 shows the target resource and the candidate topics most likely to relate to the same concept. Judges are asked to select one of the candidates (or indicate that no candidate was found appropriate). Victor uses the judgments to determine how well his classification strategy is doing."

-- http://wiki.freebase.com/wiki/File:Hcomp10-anatomy.pdf

link|flag
1

Some ideas for part 2 of your question:

  • If both resources are claimed to be owl:sameAs, then they probably are the same
  • If both resources share the same value for an inverse functional property, such as foaf:homepage, foaf:mbox, or foaf:mbox_sha1sum, then they probably are the same
  • If the resources are claimed to be owl:differentFrom, then they probably are not the same
  • If one resource is of type A and the other of type B, and A owl:disjointWith B, then they probably are not the same

But such pretty clear answers can rarely be given. You really need specific rules and heuristics to determine wether two resources are the same entity. These rules and heuristics depend on your domain and your dataset, and without knowing what your data is about this can't be answered. There are very few generic rules and heuristics that would work on any kind of data about any domain.

link|flag
Yes thanks a lot. I started to do comparisons using rdf:type and other domain properties. I just was wondering if this process could be "automatized". – Ale Jun 17 at 12:49
1

You can also take a machine learning approach, using either a supervised or unsupervised algorithms. We've worked on FOAF instances using SVMs, described in:

Jennifer Sleeman and Tim Finin, A Machine Learning Approach to Linking FOAF Instances, Proceedings of the AAAI Spring Symposium on Linked Data Meets Artificial Intelligence, 22-24 March 2010, Stanford CA, AAAI Press.
link|flag

Your Answer

Get an OpenID
or

Not the answer you're looking for? Browse other questions tagged or ask your own question.