Name		Name	Last commit message	Last commit date
parent directory ..
00a_data_download_prep.ipynb		00a_data_download_prep.ipynb
00b_movielens_eda.ipynb		00b_movielens_eda.ipynb
00c_get_wikidata_side_features.ipynb		00c_get_wikidata_side_features.ipynb
README.md		README.md
example_gen_features.md		example_gen_features.md
vocab_dict.pkl		vocab_dict.pkl

README.md

Download and prep MovieLens data

See 00a_data_download_prep.ipynb to download and prepare the movielens dataset:

python example_gen_movielens.py --project_id=$PROJECT_ID \
    --gcs_bucket_name=$BUCKET_NAME \
    --gcs_data_path_prefix=$EXAMPLE_GEN_GCS_PATH \
    --tfrecord_prefix=$TF_RECORD_PREFIX \
    --local_data_dir=$LOCAL_EXTRACT_DIR \
    --local_output_dir=$LOCAL_OUTPUT_DIR \
    --min_timeline_length=$MIN_TIMELINE_LENGTH \
    --max_context_length=$MAX_CONTEXT_LENGTH \
    --max_context_movie_genre_length=$MAX_GENRE_LENGTH \
    --min_rating=$MIN_RATING \
    --train_data_fraction=0.9 \
    --build_vocabs=$BUILD_VOCAB \
    --num_train_tfrecords=$NUM_TRAIN_RECORDS \
    --num_test_tfrecords=$NUM_VAL_RECORDS

WIP notebooks

00b_movielens_eda.ipynb
00c_get_wikidata_side_features.ipynb (enrich movielens DS)

Notes on source data and prepared features

raw movielens data

ratings.dat data is in the following format: UserID::MovieID::Rating::Timestamp

UserIDs range between 1 and 6040
MovieIDs range between 1 and 3952
Ratings are made on a 5-star scale (whole-star ratings only)
Timestamp is represented in seconds since the epoch as returned by time(2)
Each user has at least 20 ratings

features

Context features

If user preferences are relatively stable across contexts and time, context features may not provide much benefit. If, however, users preferences are highly contextual, adding context will improve the model significantly.

For example, day of the week may be an important feature when deciding whether to recommend a short clip or a movie: users may only have time to watch short content during the week, but can relax and enjoy a full-length movie during the weekend. Similarly, query timestamps may play an important role in modelling popularity dynamics: one movie may be highly popular around the time of its release, but decay quickly afterwards. Conversely, other movies may be evergreens that are happily watched time and time again.

Data sparsity

Using non-id features may be critical if data is sparse. With few observations available for a given user or item, the model may struggle with estimating a good per-user or per-item representation. To build an accurate model, other features such as item categories, descriptions, and images have to be used to help the model generalize beyond the training data. This is especially relevant in cold-start situations, where relatively little data is available on some items or users.

our tf examples

a single tf-example in our generated dataset should look similar to the following:

features {
  feature {
    key: "context_movie_genre"
    value {
      bytes_list {
        value: "Comedy"
        value: "Drama"
        value: "Drama"
        value: "War"
        value: "Drama"
        value: "Drama"
        value: "Thriller"
        value: "Drama"
        value: "Romance"
        value: "Children\'s"
      }
    }
  }
  feature {
    key: "context_movie_id"
    value {
      bytes_list {
        value: "2858"
        value: "527"
        value: "515"
        value: "593"
        value: "265"
        value: "34"
        value: "1704"
        value: "3418"
        value: "1179"
        value: "150"
      }
    }
  }
  feature {
    key: "context_movie_rating"
    value {
      float_list {
        value: 4.0
        value: 5.0
        value: 4.0
        value: 3.0
        value: 5.0
        value: 5.0
        value: 4.0
        value: 4.0
        value: 4.0
        value: 4.0
      }
    }
  }
  feature {
    key: "context_movie_title"
    value {
      bytes_list {
        value: "American Beauty (1999)"
        value: "Schindler\'s List (1993)"
        value: "Remains of the Day, The (1993)"
        value: "Silence of the Lambs, The (1991)"
        value: "Like Water for Chocolate (Como agua para chocolate) (1992)"
        value: "Babe (1995)"
        value: "Good Will Hunting (1997)"
        value: "Thelma & Louise (1991)"
        value: "Grifters, The (1990)"
        value: "Apollo 13 (1995)"
      }
    }
  }
  feature {
    key: "context_movie_year"
    value {
      int64_list {
        value: 1999
        value: 1993
        value: 1993
        value: 1991
        value: 1992
        value: 1995
        value: 1997
        value: 1991
        value: 1990
        value: 1995
      }
    }
  }
  feature {
    key: "context_rating_timestamp"
    value {
      int64_list {
        value: 962765672
        value: 962765704
        value: 962765731
        value: 962765760
        value: 962765816
        value: 962765816
        value: 962765845
        value: 962765918
        value: 962765918
        value: 962765918
      }
    }
  }
  feature {
    key: "target_movie_genres"
    value {
      bytes_list {
        value: "Drama"
        value: "Romance"
        value: "War"
      }
    }
  }
  feature {
    key: "target_movie_id"
    value {
      bytes_list {
        value: "1094"
      }
    }
  }
  feature {
    key: "target_movie_rating"
    value {
      float_list {
        value: 4.0
      }
    }
  }
  feature {
    key: "target_movie_title"
    value {
      bytes_list {
        value: "Crying Game, The (1992)"
      }
    }
  }
  feature {
    key: "target_movie_year"
    value {
      int64_list {
        value: 1992
      }
    }
  }
  feature {
    key: "target_rating_timestamp"
    value {
      int64_list {
        value: 962765918
      }
    }
  }
  feature {
    key: "user_age"
    value {
      int64_list {
        value: 35
      }
    }
  }
  feature {
    key: "user_gender"
    value {
      bytes_list {
        value: "F"
      }
    }
  }
  feature {
    key: "user_id"
    value {
      bytes_list {
        value: "4876"
      }
    }
  }
  feature {
    key: "user_occupation_text"
    value {
      bytes_list {
        value: "technician/engineer"
      }
    }
  }
  feature {
    key: "user_zip_code"
    value {
      bytes_list {
        value: "98201"
      }
    }
  }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

00-data-prep-eda

00-data-prep-eda

README.md

Download and prep MovieLens data

Notes on source data and prepared features

raw movielens data

features

our tf examples

Files

00-data-prep-eda

Directory actions

More options

Directory actions

More options

Latest commit

History

00-data-prep-eda

Folders and files

parent directory

README.md

Download and prep MovieLens data

Notes on source data and prepared features

raw movielens data

features

our tf examples