Skip to main content

Utilise Jupyter all-spark-notebook

  1. ICT707 Week 6 Lab Exercises

    Learning Objectives

In this lab you will learn how to:

  • Build a music recommender with SparkManipulate data into required layouts

  1. Suggested IDE and language

Utilise Jupyter all-spark-notebook on Docker. Please see the installation document on how to install and setup Docker and Juyiter(all-spark-notebook).

  1. Open a terminal and type:

sudo docker run -it –rm -v /home/grant/GrantPrivate/USC/spark:/home/spark -p 8888:8888 jupyter/all-spark-notebook

  1. Open a browser and copy/paste this URL into your browser when you connect for the first time http://localhost:8888/?token=683501a0baf6c7179fe4632f6aaf8049f4e120f04952e839

  1. In the top right hand corner click on ‘New’

  2. Then select ‘Python 3’ to give you the Python 3 work environment

  1. Saving and downloading the task

To download a file just select File>Download as

Download the file as a Notebook (.ipynb)

  1. Week 6 Lab tasks

    1. Task 1: Music Recommender System using Apache Spark and Python

Description

For this project, you are to create a recommender system that will recommend new musical artists to a user based on their listening history. Suggesting different songs or musical artists to a user is important to many music streaming services, such as Pandora and Spotify. In addition, this type of recommender system could also be used as a means of suggesting TV shows or movies to a user (e.g., Netflix).

To create this system you will be using Spark and the collaborative filtering technique. The instructions for completing this project will be laid out entirely in this file. You will have to implement any missing code as well as answer any questions.

Datasets

You will be using some publicly available song data from audioscrobbler, namely:

  • artist_data_small.txt

  • artist_alias_small.txt

  • user_artist_data_small.txt

Note that when plays are scribbled, the client application submits the name of the artist being played. This name could be misspelled or nonstandard, and this may only be detected later. For example, “The Smiths”, “Smiths, The”, and “the smiths” may appear as distinct artist IDs in the data set, even though they clearly refer to the same artist. So, the data set includes artist_alias.txt, which maps artist IDs that are known misspellings or variants to the canonical ID of that artist. The artist_data.txt file then provides a map from the canonical artist ID to the name of the artist.

Loading data

Load the three datasets into RDDs and name them artistData, artistAlias, and userArtistData. Some of the files have tab delimeters while some have space delimiters. Make sure that your userArtistData RDD contains only the canonical artist IDs.

Data Exploration

Write some code that with find the users’ total play counts. Find the three users with the highest number of total play counts (sum of all counters) and print the user ID, the total play count, and the mean play count (average number of times a user played an artist). Your output should look as follows:

User 1059637 has a total play count of 674412 and a mean play count of 1878.5849582172702.

User 2064012 has a total play count of 548427 and a mean play count of 9455.637931034482.

User 2069337 has a total play count of 393515 and a mean play count of 1519.3629343629343.

 

Splitting Data for Testing

Use the randomSplit function to divide the data (userArtistData) into:

  • A training set, trainData, that will be used to train the model. This set should constitute 40% of the data.

  • A validation set, validationData, used to perform parameter tuning. This set should constitute 40% of the data.

  • A test set, testData, used for a final evaluation of the model. This set should constitute 20% of the data.

Use a random seed value of 13. Since these datasets will be repeatedly used you will probably want to persist them in memory using the cache function.

[(1059637, 1000049, 1), (1059637, 1000056, 1), (1059637, 1000114, 2)]

[(1059637, 1000010, 238), (1059637, 1000062, 11), (1059637, 1000123, 2)]

[(1059637, 1000094, 1), (1059637, 1000112, 423), (1059637, 1000113, 5)]

19769

19690

10022

In addition, print out the first 3 elements of each set as well as their sizes; if you created these sets correctly, your output should look as follows:

Model Evaluation

Although there may be several ways to evaluate a model, we will use a simple method here. Suppose we have a model and some dataset of true artist plays for a set of users. This model can be used to predict the top X artist recommendations for a user and these recommendations can be compared the artists that the user actually listened to (here, X will be the number of artists in the dataset of true artist plays). Then, the fraction of overlap between the top X predictions of the model and the X artists that the user actually listened to can be calculated. This process can be repeated for all users and an average value returned.

For example, suppose a model predicted [1,2,4,8] as the top X=4 artists for a user. Suppose, that user actually listened to the artists [1,3,7,8]. Then, for this user, the model would have a score of 2/4=0.5. To get the overall score, this would be performed for all users, with the average returned.

NOTE: when using the model to predict the top-X artists for a user, do not include the artists listed with that user in the training data.

Name your function modelEval and have it take a model (the output of ALS.trainImplicit) and a dataset as input. For parameter tuning, the dataset parameter should be set to the validation data (validationData). After parameter tuning, the model can be evaluated on the test data (testData).

Model Construction

Now we can build the best model possibly using the validation set of data and the modelEval function. Although, there are a few parameters we could optimize, for the sake of time, we will just try a few different values for the rank parameter (leave everything else at its default value, except make seed=345). Loop through the values [2, 10, 20] and figure out which one produces the highest scored based on your model evaluation function.

Note: this procedure may take several minutes to run.

For each rank value, print out the output of the modelEval function for that model. Your output should look as follows:

The model score for rank 2 is 0.08899463771264418

The model score for rank 10 is 0.08564036660709866

The model score for rank 20 is 0.0882907866257202

 

Now, using the bestModel, we will check the results over the test data.

0.05355033022893128

 

Trying Some Artist Recommendations

Using the best model above, predict the top 5 artists for user 1059637 using the recommendProducts function. Map the results (integer IDs) into the real artist name using artistAlias. Print the results. The output should look as follows:

Artist 0: Something Corporate

Artist 1: My Chemical Romance

Artist 2: Green Day

Artist 3: Taking Back Sunday

Artist 4: The Used

 

    1. Task 2: Spark and Artist data

  1. Load the file user_artist_data_small.txt

  2. View the first 10 entries

[‘1059637 1000010 238’,

‘1059637 1000049 1’,

‘1059637 1000056 1’,

‘1059637 1000062 11’,

‘1059637 1000094 1’,

‘1059637 1000112 423’,

‘1059637 1000113 5’,

‘1059637 1000114 2’,

‘1059637 1000123 2’,

‘1059637 1000130 19129’]

 

  1. Display the statistics on the user artist data

(count: 49481, mean: 130.5757967704775, stdev: 3034.35409229, max: 439771.0, min: 1.0)

 

  1. Format the user artist data into: [Rating(user=x, product=y, rating=z)

[Rating(user=1059637, product=1000010, rating=238.0),

Rating(user=1059637, product=1000112, rating=423.0),

Rating(user=1059637, product=1000130, rating=19129.0),

Rating(user=1059637, product=1000241, rating=188.0),

Rating(user=1059637, product=1000263, rating=180.0),

Rating(user=1059637, product=1000320, rating=21.0),

Rating(user=1059637, product=1000427, rating=20.0),

Rating(user=1059637, product=1000445, rating=88.0),

Rating(user=1059637, product=1000632, rating=250.0),

Rating(user=1059637, product=1000999, rating=22.0)]

 

  1. Generate recommendations

[Rating(user=1059637, product=4267, rating=1.0737777242257698),

Rating(user=1059637, product=1006123, rating=1.063707083597756),

Rating(user=1059637, product=1004296, rating=1.0146439932352593),

Rating(user=1059637, product=1002128, rating=1.002596414515681),

Rating(user=1059637, product=1002095, rating=1.0003925092689543)]

 

  1. Load the artist_data_small.txt and create a lookup for artists

[‘Aerosmith’]

[‘Alkaline Trio’]

[‘Bright Eyes’]

[‘Jason Mraz’]

[‘Jimmy Eat World’]

[‘Hoobastank’]

[‘Goo Goo Dolls’]

[‘Modest Mouse’]

[‘Something Corporate’]

[‘Taking Back Sunday’]

[‘The Movielife’]

[‘Good Charlotte’]

[‘Nena’]

[‘Billy Joel’]

[‘Dashboard Confessional’]

[‘hellogoodbye’]

[‘Clint Mansell’]

[‘Brand New’]

[‘Thursday’]

[‘The Von Bondies’]

[‘Motion City Soundtrack’]

[‘Cursive’]

[‘Onelinedrawing’]

[‘A Static Lullaby’]

[‘Coheed and Cambria’]

[‘Further Seems Forever’]

[‘Hey Mercedes’]

[‘Hopesfall’]

[‘Underoath’]

[‘Hot Hot Heat’]

[‘Frou Frou’]

[‘Hanson’]

[‘Mae’]

[‘Against Me!’]

[‘Remy Zero’]

[‘Colin Hay’]

[‘Senses Fail’]

[‘Klaus Badelt’]

[‘Boys Night Out’]

[‘Dane Cook’]

[‘Fall Out Boy’]

[‘My Chemical Romance’]

[‘Story of the Year’]

[‘Blaque’]

[‘Say Anything’]

[‘Jonathan Larson’]

[‘A Trunk Full Of Dead Bodies’]

[‘Emocapella’]

[‘Hidden in Plain View’]

[‘Ryan Cabrera’]

[‘Coldplay’]

[‘Electric Light Orchestra’]

[‘Elliott Smith’]

[‘Head Automatica’]

[‘U2’]

[‘Ima Robot’]

[‘Beastie Boys’]

[‘The Early November’]

[‘The Postal Service’]

[‘The Rocket Summer’]

[‘The Decemberists’]

[‘The Shins’]

[‘Straylight Run’]

[‘The Format’]

[‘Ashlee Simpson’]

[‘Iron & Wine’]

[‘Perfect Endings’]

[‘Oasis’]

[‘Nightmare Of You’]

[‘The Killers’]

[‘Cary Brothers’]

[“I Can Make a Mess Like Nobody’s Business”]

[‘Zero 7’]

[‘Hawthorne Heights’]

[‘Thievery Corporation’]

[‘Summit Drive’]

[‘Matthew Walker’]

[‘Nick Drake’]

[‘Death Cab for Cutie’]

[‘Lou Reed’]

[‘Green Day’]

[‘Simon & Garfunkel’]

[‘The Smiths’]

[‘Less Than Jake’]

[‘Bonnie Somerville’]

[‘Domestic Disturbance’]

[‘The NSG’]

 

  1. Link up the recommendations and the artist lookup

[‘Green Day’]

[‘Cursive’]

[‘Thrice’]

[‘Taking Back Sunday’]

[‘Something Corporate’]

 

The post Utilise Jupyter all-spark-notebook appeared first on My Assignment Tutor.



Logo GET THIS PAPER COMPLETED FOR YOU FROM THE WRITING EXPERTS  CLICK HERE TO ORDER 100% ORIGINAL PAPERS AT PrimeWritersBay.com

Comments

Popular posts from this blog

Identify and discuss a key milestone in the history of computers that interests you and why.

  Part 1Title: Lab ResponseDiscuss one feature of MS Word and one feature of MS Excel that you found challenging within the lab and why. Examples are WordArt, inserting shapes, adding borders, cell styles, etc. This response should be at least one paragraph in length. Part 2Title: History of Computers Identify and discuss a key milestone in the history of computers that interests you and why. This section should be at least one paragraph. Part 3Title: System Software vs. Application Software In your words, explain the difference between application software and system software as if to another coworker who has limited technical knowledge. Use examples to support your rationalization. This section should be at least two paragraphs. Part 4Title: Blockchain and Cryptocurrency In a minimum of one paragraph each: 1. Conduct some research on the internet and discuss one underlying technology of cryptocurrencies like blockchain, cryptography, distributed ledger technol...

Cybersecurity and Infrastructure Security (CISA)

 Develop a research paper that identifies a specific Department of Homeland Security (DHS) operating agency. Fully describe 1 DHS operating agency from the following list: Cybersecurity and Infrastructure Security (CISA) U.S. Customs and Border Protection (CBP) U.S. Citizenship and Immigration Services (USCIS) Federal Emergency Management Agency (FEMA) U.S. Coast Guard (USCG) U.S. Immigration and Customs Enforcement (ICE) U.S. Secret Service (USSS) Transportation Security Administration (TSA) The information must include a discussion of the selected DHS agency. Identify the agency’s mission, goals, objectives, and metrics. Conduct an analysis of how these mission areas address the threats or challenges. Recommend agency program priorities among the current set of goals, objectives, metrics, or budget items. Justification of all choices is an essential element of this assignment. Reference all source material and citations using APA format. WE OFF...

Discuss how the project

ord count : no idea 1. You are required to write a report on all project activities involved in all the 10 knowledge areas of project management for the entire project life cycle. You should also include a list of the respective PM documents, for example PM Plan, PM Quality Management Plan, Risk management, Procurement, etc. The report must include the activities that are considered before the project is closed out. 2. Discuss how the project quality management plan can provide adequate standards and controls in managing global teams in projects. Your discussion must provide adequate arguments for the need of cultural awareness and legal issues. Regards, The post Discuss how the project appeared first on My Assignment Tutor . Assignment status :  Resolved by our Writing Team Source@ PrimeWritersBay.com GET THIS PAPER COMPLETED FOR YOU FROM THE WRITING EXPERTS   CLICK HERE TO ORDER 100% ORIGINAL PAPERS AT PrimeWritersBay.com NO PLAGIARISM