Skip to content

Instantly share code, notes, and snippets.

@email2liyang
Created November 19, 2017 09:56
Show Gist options
  • Save email2liyang/f2be5e277f33093dc41055b8fcaf6e47 to your computer and use it in GitHub Desktop.
Save email2liyang/f2be5e277f33093dc41055b8fcaf6e47 to your computer and use it in GitHub Desktop.
pig script to show the most popular five star movies
ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS (userId:int,movieId:int,rating:int,ratingTime:int);
metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING PigStorage('|')
AS (movieId:int,movieTitle:chararray,releaseDate:chararray,videoRelease:chararray,imdbLink:chararray);
nameLookup = FOREACH metadata GENERATE movieId,movieTitle,ToUnixTime(ToDate(releaseDate,'dd-MMM-yyyy')) AS releaseTime;
ratingByMovie = Group ratings By movieId;
avgRatings = FOREACH ratingByMovie GENERATE group as movieId,AVG(ratings.rating) AS avgRating;
fiveStarMovies = FILTER avgRatings By avgRating > 4.0;
fiveStarWithData = JOIN fiveStarMovies By movieId,nameLookup By movieId;
mostPopularFiveStartMovies = ORDER fiveStarWithData By fiveStarMovies::avgRating DESC;
DUMP mostPopularFiveStartMovies;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment