Skip to content

Week 03 and 04

Alvaro Lopes edited this page Jul 18, 2023 · 3 revisions

TL;DR

During weeks 3 and 4, I wrapped up the Entity Linking step and started the Data Enriching step for Data Integration. I was able to code:

  • The class Steam and BookCrossing and its Entity linking related methods to match each dataset item with DBpedia;
  • The enrich() method of MovieLens class that is responsible for enriching each mapped item with some chosen DBpedia properties. The properties analyzed and selected are indicated in data_integration/metadata.md

Steam matching result:

  • 7549 matches in a total of 48988(15.41%)

Entity Linking Steam

The dataset used can be found in Steam-Kaggle.

The class Steam, derived from Dataset, implements the necessary methods to convert item data to a standardized .csv file (To be done: converting user and rating data) and match each game with their corresponding DBpedia's URI.

In the case of Steam, the most important field is the title field. The baseline for this dataset is similar to the other ones, consisting in:

  1. Match the extracted game title with the rdfs:label of DBpedia's URI, using regex. See the SPARQL query below for more details.
  2. Use the type dbo:VideoGame to search only for resources related to video games.
  3. When the query returns more than one URI, Levenshtein distance will be used to choose the most similar one.

The query template is:

PREFIX dct:  <http://purl.org/dc/terms/>
    PREFIX dbo:  <http://dbpedia.org/ontology/>
    PREFIX dbr:  <http://dbpedia.org/resource/>
    PREFIX rdf:	 <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    SELECT DISTINCT ?game WHERE {
        {
            ?game rdf:type dbo:VideoGame .
            ?game rdfs:label ?label .
            FILTER regex(?label, "$name_regex", "i")
        }
        UNION
        {
            ?game rdf:type dbo:VideoGame .
            ?tmp dbo:wikiPageRedirects ?game .
            ?tmp rdfs:label ?label .
            FILTER regex(?label, "$name_regex", "i") .
        }
    }

The placeholder $name_regex should be replaced with the game title regex to match the label.

Enriching Datasets

The idea of this step is to enrich the RS dataset with useful DBpedia properties that can provide more context for a better RS performance.

To perform the data enriching each Dataset subclass need to implement the following method:

  • enrich(df_map): takes as argument a pd.Dataframe() corresponding to the mapping for the dataset and returns a pd.Dataframe() containing each item_id and their retrieved properties.

Before performing the Data Enriching step, it is necessary to evaluate which DBpedia's properties can be useful for each RS domain. For now, the properties were selected using the rate of URIs with this property. The retrieved properties will be documented on data_integration/metadata.md.

Enriching MovieLens

In the case of MovieLens dataset, the most useful properties are related to the movie director(s), writer(s), producer(s), subject(s), and more. The template query is:

    PREFIX dct:  <http://purl.org/dc/terms/>
    PREFIX dbo:  <http://dbpedia.org/ontology/>
    PREFIX dbr:  <http://dbpedia.org/resource/>
    PREFIX rdf:	 <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    SELECT DISTINCT
        ?abstract 
        (GROUP_CONCAT(DISTINCT ?producer; SEPARATOR="::") AS ?producer)
        (GROUP_CONCAT(DISTINCT ?distributor; SEPARATOR="::") AS ?distributor)
        (GROUP_CONCAT(DISTINCT ?writer; SEPARATOR="::") AS ?writer)
        (GROUP_CONCAT(DISTINCT ?cinematography; SEPARATOR="::") AS ?cinematography)
        (GROUP_CONCAT(DISTINCT ?subject; SEPARATOR="::") AS ?subject)
        (GROUP_CONCAT(DISTINCT ?starring; SEPARATOR="::") AS ?starring)
        (GROUP_CONCAT(DISTINCT ?director; SEPARATOR="::") AS ?director)
    WHERE {
        OPTIONAL { <$URI>   dct:subject         ?subject            }   .
        OPTIONAL { <$URI>   dbo:starring        ?starring           }   .
        OPTIONAL { <$URI>   dbo:director        ?director           }   .
        OPTIONAL { <$URI>   dbo:abstract        ?abstract           }   .
        OPTIONAL { <$URI>   dbo:producer        ?producer           }   .
        OPTIONAL { <$URI>   dbo:distributor     ?distributor        }   .
        OPTIONAL { <$URI>   dbo:writer          ?writer             }   .
        OPTIONAL { <$URI>   dbo:cinematography  ?cinematography     }   .

        FILTER(LANG(?abstract) = 'en')
    }
Clone this wiki locally