Skip to main content

Utilise Jupyter all-spark-notebook

  1. ICT707 Week 6 Lab Exercises

    Learning Objectives

In this lab you will learn how to:

  • Build a music recommender with SparkManipulate data into required layouts

  1. Suggested IDE and language

Utilise Jupyter all-spark-notebook on Docker. Please see the installation document on how to install and setup Docker and Juyiter(all-spark-notebook).

  1. Open a terminal and type:

sudo docker run -it –rm -v /home/grant/GrantPrivate/USC/spark:/home/spark -p 8888:8888 jupyter/all-spark-notebook

  1. Open a browser and copy/paste this URL into your browser when you connect for the first time http://localhost:8888/?token=683501a0baf6c7179fe4632f6aaf8049f4e120f04952e839

  1. In the top right hand corner click on ‘New’

  2. Then select ‘Python 3’ to give you the Python 3 work environment

  1. Saving and downloading the task

To download a file just select File>Download as

Download the file as a Notebook (.ipynb)

  1. Week 6 Lab tasks

    1. Task 1: Music Recommender System using Apache Spark and Python

Description

For this project, you are to create a recommender system that will recommend new musical artists to a user based on their listening history. Suggesting different songs or musical artists to a user is important to many music streaming services, such as Pandora and Spotify. In addition, this type of recommender system could also be used as a means of suggesting TV shows or movies to a user (e.g., Netflix).

To create this system you will be using Spark and the collaborative filtering technique. The instructions for completing this project will be laid out entirely in this file. You will have to implement any missing code as well as answer any questions.

Datasets

You will be using some publicly available song data from audioscrobbler, namely:

  • artist_data_small.txt

  • artist_alias_small.txt

  • user_artist_data_small.txt

Note that when plays are scribbled, the client application submits the name of the artist being played. This name could be misspelled or nonstandard, and this may only be detected later. For example, “The Smiths”, “Smiths, The”, and “the smiths” may appear as distinct artist IDs in the data set, even though they clearly refer to the same artist. So, the data set includes artist_alias.txt, which maps artist IDs that are known misspellings or variants to the canonical ID of that artist. The artist_data.txt file then provides a map from the canonical artist ID to the name of the artist.

Loading data

Load the three datasets into RDDs and name them artistData, artistAlias, and userArtistData. Some of the files have tab delimeters while some have space delimiters. Make sure that your userArtistData RDD contains only the canonical artist IDs.

Data Exploration

Write some code that with find the users’ total play counts. Find the three users with the highest number of total play counts (sum of all counters) and print the user ID, the total play count, and the mean play count (average number of times a user played an artist). Your output should look as follows:

User 1059637 has a total play count of 674412 and a mean play count of 1878.5849582172702.

User 2064012 has a total play count of 548427 and a mean play count of 9455.637931034482.

User 2069337 has a total play count of 393515 and a mean play count of 1519.3629343629343.

 

Splitting Data for Testing

Use the randomSplit function to divide the data (userArtistData) into:

  • A training set, trainData, that will be used to train the model. This set should constitute 40% of the data.

  • A validation set, validationData, used to perform parameter tuning. This set should constitute 40% of the data.

  • A test set, testData, used for a final evaluation of the model. This set should constitute 20% of the data.

Use a random seed value of 13. Since these datasets will be repeatedly used you will probably want to persist them in memory using the cache function.

[(1059637, 1000049, 1), (1059637, 1000056, 1), (1059637, 1000114, 2)]

[(1059637, 1000010, 238), (1059637, 1000062, 11), (1059637, 1000123, 2)]

[(1059637, 1000094, 1), (1059637, 1000112, 423), (1059637, 1000113, 5)]

19769

19690

10022

In addition, print out the first 3 elements of each set as well as their sizes; if you created these sets correctly, your output should look as follows:

Model Evaluation

Although there may be several ways to evaluate a model, we will use a simple method here. Suppose we have a model and some dataset of true artist plays for a set of users. This model can be used to predict the top X artist recommendations for a user and these recommendations can be compared the artists that the user actually listened to (here, X will be the number of artists in the dataset of true artist plays). Then, the fraction of overlap between the top X predictions of the model and the X artists that the user actually listened to can be calculated. This process can be repeated for all users and an average value returned.

For example, suppose a model predicted [1,2,4,8] as the top X=4 artists for a user. Suppose, that user actually listened to the artists [1,3,7,8]. Then, for this user, the model would have a score of 2/4=0.5. To get the overall score, this would be performed for all users, with the average returned.

NOTE: when using the model to predict the top-X artists for a user, do not include the artists listed with that user in the training data.

Name your function modelEval and have it take a model (the output of ALS.trainImplicit) and a dataset as input. For parameter tuning, the dataset parameter should be set to the validation data (validationData). After parameter tuning, the model can be evaluated on the test data (testData).

Model Construction

Now we can build the best model possibly using the validation set of data and the modelEval function. Although, there are a few parameters we could optimize, for the sake of time, we will just try a few different values for the rank parameter (leave everything else at its default value, except make seed=345). Loop through the values [2, 10, 20] and figure out which one produces the highest scored based on your model evaluation function.

Note: this procedure may take several minutes to run.

For each rank value, print out the output of the modelEval function for that model. Your output should look as follows:

The model score for rank 2 is 0.08899463771264418

The model score for rank 10 is 0.08564036660709866

The model score for rank 20 is 0.0882907866257202

 

Now, using the bestModel, we will check the results over the test data.

0.05355033022893128

 

Trying Some Artist Recommendations

Using the best model above, predict the top 5 artists for user 1059637 using the recommendProducts function. Map the results (integer IDs) into the real artist name using artistAlias. Print the results. The output should look as follows:

Artist 0: Something Corporate

Artist 1: My Chemical Romance

Artist 2: Green Day

Artist 3: Taking Back Sunday

Artist 4: The Used

 

    1. Task 2: Spark and Artist data

  1. Load the file user_artist_data_small.txt

  2. View the first 10 entries

[‘1059637 1000010 238’,

‘1059637 1000049 1’,

‘1059637 1000056 1’,

‘1059637 1000062 11’,

‘1059637 1000094 1’,

‘1059637 1000112 423’,

‘1059637 1000113 5’,

‘1059637 1000114 2’,

‘1059637 1000123 2’,

‘1059637 1000130 19129’]

 

  1. Display the statistics on the user artist data

(count: 49481, mean: 130.5757967704775, stdev: 3034.35409229, max: 439771.0, min: 1.0)

 

  1. Format the user artist data into: [Rating(user=x, product=y, rating=z)

[Rating(user=1059637, product=1000010, rating=238.0),

Rating(user=1059637, product=1000112, rating=423.0),

Rating(user=1059637, product=1000130, rating=19129.0),

Rating(user=1059637, product=1000241, rating=188.0),

Rating(user=1059637, product=1000263, rating=180.0),

Rating(user=1059637, product=1000320, rating=21.0),

Rating(user=1059637, product=1000427, rating=20.0),

Rating(user=1059637, product=1000445, rating=88.0),

Rating(user=1059637, product=1000632, rating=250.0),

Rating(user=1059637, product=1000999, rating=22.0)]

 

  1. Generate recommendations

[Rating(user=1059637, product=4267, rating=1.0737777242257698),

Rating(user=1059637, product=1006123, rating=1.063707083597756),

Rating(user=1059637, product=1004296, rating=1.0146439932352593),

Rating(user=1059637, product=1002128, rating=1.002596414515681),

Rating(user=1059637, product=1002095, rating=1.0003925092689543)]

 

  1. Load the artist_data_small.txt and create a lookup for artists

[‘Aerosmith’]

[‘Alkaline Trio’]

[‘Bright Eyes’]

[‘Jason Mraz’]

[‘Jimmy Eat World’]

[‘Hoobastank’]

[‘Goo Goo Dolls’]

[‘Modest Mouse’]

[‘Something Corporate’]

[‘Taking Back Sunday’]

[‘The Movielife’]

[‘Good Charlotte’]

[‘Nena’]

[‘Billy Joel’]

[‘Dashboard Confessional’]

[‘hellogoodbye’]

[‘Clint Mansell’]

[‘Brand New’]

[‘Thursday’]

[‘The Von Bondies’]

[‘Motion City Soundtrack’]

[‘Cursive’]

[‘Onelinedrawing’]

[‘A Static Lullaby’]

[‘Coheed and Cambria’]

[‘Further Seems Forever’]

[‘Hey Mercedes’]

[‘Hopesfall’]

[‘Underoath’]

[‘Hot Hot Heat’]

[‘Frou Frou’]

[‘Hanson’]

[‘Mae’]

[‘Against Me!’]

[‘Remy Zero’]

[‘Colin Hay’]

[‘Senses Fail’]

[‘Klaus Badelt’]

[‘Boys Night Out’]

[‘Dane Cook’]

[‘Fall Out Boy’]

[‘My Chemical Romance’]

[‘Story of the Year’]

[‘Blaque’]

[‘Say Anything’]

[‘Jonathan Larson’]

[‘A Trunk Full Of Dead Bodies’]

[‘Emocapella’]

[‘Hidden in Plain View’]

[‘Ryan Cabrera’]

[‘Coldplay’]

[‘Electric Light Orchestra’]

[‘Elliott Smith’]

[‘Head Automatica’]

[‘U2’]

[‘Ima Robot’]

[‘Beastie Boys’]

[‘The Early November’]

[‘The Postal Service’]

[‘The Rocket Summer’]

[‘The Decemberists’]

[‘The Shins’]

[‘Straylight Run’]

[‘The Format’]

[‘Ashlee Simpson’]

[‘Iron & Wine’]

[‘Perfect Endings’]

[‘Oasis’]

[‘Nightmare Of You’]

[‘The Killers’]

[‘Cary Brothers’]

[“I Can Make a Mess Like Nobody’s Business”]

[‘Zero 7’]

[‘Hawthorne Heights’]

[‘Thievery Corporation’]

[‘Summit Drive’]

[‘Matthew Walker’]

[‘Nick Drake’]

[‘Death Cab for Cutie’]

[‘Lou Reed’]

[‘Green Day’]

[‘Simon & Garfunkel’]

[‘The Smiths’]

[‘Less Than Jake’]

[‘Bonnie Somerville’]

[‘Domestic Disturbance’]

[‘The NSG’]

 

  1. Link up the recommendations and the artist lookup

[‘Green Day’]

[‘Cursive’]

[‘Thrice’]

[‘Taking Back Sunday’]

[‘Something Corporate’]

 

The post Utilise Jupyter all-spark-notebook appeared first on My Assignment Tutor.



Logo GET THIS PAPER COMPLETED FOR YOU FROM THE WRITING EXPERTS  CLICK HERE TO ORDER 100% ORIGINAL PAPERS AT PrimeWritersBay.com

Comments

Popular posts from this blog

Should pit bull terriers be banned in my community

 Discussion Forum: Counterarguments (Should pit bull terriers be banned in my community) You created a question about the topic for your W6 Rough Draft. For this discussion, you will give an answer to that question in the form of a thesis statement. "Dieting Makes People Fat" Main Post: Share your thesis statement with your classmates. Please note: As with last week’s discussion, nothing here is set in stone. Be open to changing everything about your topic, including your position and audience, as you research it and get feedback from your classmates. Topic + Position/Purpose + Supporting Points =Thesis Statement Example: Suppose the question you posed in the Week 5 discussion was something like, “Should pit bull terriers be banned in my community?” After doing some preliminary research, you have concluded that pit bulls, if raised properly, are no more dangerous than other breeds of dogs. Your thesis statement can be something like, “Pitbulls should not be banned

Controversy Associated With Dissociative Disorders

 Assignment: Controversy Associated With Dissociative Disorders The  DSM-5-TR  is a diagnostic tool. It has evolved over the decades, as have the classifications and criteria within its pages. It is used not just for diagnosis, however, but also for billing, access to services, and legal cases. Not all practitioners are in agreement with the content and structure of the  DSM-5-TR , and dissociative disorders are one such area. These disorders can be difficult to distinguish and diagnose. There is also controversy in the field over the legitimacy of certain dissociative disorders, such as dissociative identity disorder, which was formerly called multiple personality disorder. In this Assignment, you will examine the controversy surrounding dissociative disorders. You will also explore clinical, ethical, and legal considerations pertinent to working with patients with these disorders. Photo Credit: Getty Images/Wavebreak Media To Prepare · Review this week’s Learning

CYBER SECURITY and how it can impact today's healthcare system and the future

 Start by reading and following these instructions: Create your Assignment submission and be sure to cite your sources, use APA style as required, and check your spelling. Assignment: Recommendations Document Due Week 6 (100 pts) Main Assignment Recommendations Document The 1250 to 1500-word deliverable for this week is an initial draft of your recommendations. Note that this is a working document and may be modified based on insights gained in module eight and your professor's feedback. This document should contain the following elements: Summary of your problem or opportunity definition A list of possible recommendation alternatives. In this section, you are not yet at the point of suggesting the best set of recommendations but you are trying to be creative and explore all the different ways that the problem or opportunity might best be addressed. The end result here will be a list of alternatives among which you will choose your final recom