讲解：COMP9321、dataframe、Python、PythonR|SQL

Resources / Assignments (/COMP9321/20T1/resources/41975)/ Week 3 (/COMP9321/20T1/resources/41976) / Assignment-1Assignment-1The assignment data has been extracted from a Movie dataset on Kaggle (https://www.kaggle.com/rounakbanik/themovies-dataset), with some minor modification to make things interesting. The dataset is split into two CSV filescredits (https://github.com/mysilver/COMP9321-Data-Services/raw/master/20t1/credits.csv) and movies(https://github.com/mysilver/COMP9321-Data-Services/raw/master/20t1/movies.csv) . Use the datasets toanswer the following questions:Question 1: (based on the both datasets) (0.5 Mark)Join the two datasets based on the id columns in the datasets, keeping the rows as long as there is a matchbetween the id columns of both dataset (do not concatenate the datasets).Question 2: ( based on the dataframe created in Question-1 ) ( 0.5 Mark )Keep the following columns in the resultant dataframe (remove the rest of columns from the result dataset): id, title, popularity, cast, crew, budget, genres, original_language, production_companies,production_countries, release_date, revenue, runtime, spoken_languages, vote_average, vote_countQuestion 3: ( based on the dataframe created in Question-2 ) ( 0.5 Mark )Set the index of the resultant dataframe as id.Question 4: ( based on the dataframe created in Question-3 ) ( 0.5 Mark )Drop all rows where the budget is 0Question 5: (based on the dataframe created in Question-4) (1 Mark)Assume that there is a ranking scheme for movies defined by (revenue - budget)/budget . Add a new columnfor the dataframe, and name it success_impact, and calculate it for each movie based on the given formula.Question 6: (based on the dataframe created in Question-5) (1 Mark)Normalize the popularity column by scaling between 0 to 100. The least popular movie should be 0 and themost popular one must be 100. It is a float number.Question 7: (based on the dataframe created in Question-6) ( 0.5 Mark )Specification Make Submission Check Submission Collect SubmissionChange the data type of the popularity column to (int16).Question 8: (based on the dataframe created in Question-7) (1.5 Marks)Clean the cast column by converting the complex value (JSONs) to a comma separated value. The cleanedcast column should be a comma-separated value of characters and alphabetically sorted according to theirnames (e.g., Angela, Athena, Betty, Chester Rush ) . NOTE: keep unusual names e.g., (uncredited) as they are;no need for further cleansing.Question 9: (based on the dataframe created in Question-8) (1.5 Marks)Return a list, containing the names of the top 10 movies according to the number of movie characters (HarryPotter! is one character! do not count the letters in the title of movies!). The first element in the list should bethe movie with the most number of characters.Question 10 : (based on the dataframe created in Question-8) (1 Marks)Sort the dataframe by the release date (the most recently released movie should be first row in the dataframe)Question 11: (based on the dataframe created in Question-8) (2 Marks)- ( 1 .5 Mark ) Plot a pie chart, showing the distribution of genres in the dataset (e.g., Family, Drama).- ( 0.5 Mark ) Show the percentage of each genre in the pie chart. Please be noted that the following figure isjust a sample and it does not reflect the real values or the list of all genres in the dataset.Question 12 : (based on the dataframe created in Question-8) (2 Marks)- (1.5 Marks) Plot a bar chart of the countries in which movies have been produced. For each county you needto show the count of movies.- (0.5 Mark) Countries should be alphabetically sorted according to their names.Please be noted that the following figure is just a sample and it does not reflect the real values or the list of allcountries in the dataset. Question 13: (based on the dataframe created in Question-8) (2.5 Marks)- (1.5 Marks) Plot a scatter chart with x axis being vote_average and y axis being success_impact.- (0.5 Marks) Ink bubbles based on the movie language (e.g, English, French); In case of having multiplelanguages for the same movie, you are free to pick any one as you wish.- (0.5 Marks) Add a legend showing the name of languages and their associated colors.Please be noted that the following figure is just a sample and it does not reflect the real values or the list of allcountries in the dataset.What not to forget!Due Date: Friday the 13th of March 2020 17:59Submit your script named YOUR_ZID .py (z2123232.py) which contains your code.You are required to use the following code template ( it is not complete; please download the file ) for yoursubmission:You can download the code template from : https://raw.githubusercontent.com/mysilver/COMP9321-Data-Services/master/20t1/z1111111.py (https://raw.githubusercontent.com/mysilver/COMP9321-DataServices/master/20t1/z1111111.py)If you do not follow this structure, you will not be marked.You can only add codes in the specified lines (do not edit the rest of the lines):import astimport jsonimport matplotlib.pyplot as pltimport pandas as pdimport sysimport osstudentid = os.path.basename(sys.modules[__name__].__file__)################################################## Your personal methods can be here ...#################################################def log(question, output_df, other): print(--------------- {}----------------.format(question)) if other is not None: print(question, other) if output_df is not None: print(output_df.head(5).to_string())def question_1(movies, credits): :param movies: the path for the movie.csv file :param credits: the path for the credits.csv file :return: df1 Data Type: Dataframe Please read the assignment specs to know how to create the output dataframe ################################################# # Your code goes here ... ################################################# log(QUESTION 1, output_df=df1, other=df1.shape) return df1...if __name__ == __main__: df1 = question_1(movies.csv, credits.csv) df2 = question_2(df1) df3 = question_3(df2) df4 = question_4(df3) df5 = question_5(df4) df6 = question_6(df5) df7 = question_7(df6) df8 = question_8(df7) movies = question_9(df8) df10 = question_10(df8) question_11(df10) question_12(df10) question_13(df10)If your code does not run on CSE machines for any reasons (e.g., hard-coded file path such asC://Users/), you will be penalize at least by 5 marks. We assume that the two csv files are located in thesame directory of your script, and the name is the same as the one in the template (movies.csv, andcredits.csv)Please look at the documentation for each question method; it describes the inputs (e.g., a dataframe)and output (e.g., dataframe, list of movies) of the method.Please use the same variable names as mentioned in theCOMP9321作业代做、代写dataframe课程作业、Python程序设计作业调试、代写Python语言作业代写R comments (e.g., in question 8, you aresupposed to create a dataframe and name it df8In the last three questions, you need to plot charts; please do not use plt.show() function to pop upcharts. The code template will automatically save the chart on the disk. What you need to do is to justcall the plot functions of the dataframe (e.g., df.plot.pie()). We highly recommend you go through the labactivities to know how to plot charts.FAQ:Can I pass extra variables to functions?NoCan we create our own functions besides the question functions (e.g., question_1)?YesCan I call another function inside the question functions? e.g., calling question_1 inside question_2YesWhat should I do if my charts are not shown automatically?Look at the lab sample codes; if still need a help, ask your tutor during the labs.How should I print my dataframe?print(df.to_string())Is it okay that the graph for Q8 does not pop up until the graph for Q7 is closed or should they both popup at the same time?This is fineDo the charts need to look the same (colors, legend position, grid) as the examples shown? or would itbe fine to just use the default plotting from pandas?The default colours/fonts are fineHow are our submissions marked?They are marked manually by tutors, by running the following command: python3 z{YOUR_ZID}.pyWhat python packages can I use in my assignment?You can only use pandas and matplotlib to do the assignment.What version of python should I use?################################################## Your code goes here ...#################################################:param df7: the dataframe created in question 7:return: df8 Data Type: Dataframe Please read the assignment specs to know how to create the output dataframeResource created 27 days ago (27 days ago), last modified about 2 hours ago (about 2 hours ago).Python 3+How I can submit my assignment?Go to the assignment page click on the Make Submission tab; pick your files which must be namedYOUR_ZID.py. Make sure that the files are not empty, and submit the files together.Can I submit my file after deadline?Yes, you can. But 25% of your assignment will be deducted as a late penalty per day. In other words, if you belate for more than 3 days, you will not be marked.PlagiarismThis is an individual assignment . The work you submit must be your own work. Submission of work partially orcompletely derived from any other person or jointly written with any other person is not permitted. The penalties forsuch offence may include negative marks, automatic failure of the course and possibly other academic discipline.Assignment submissions will be examined manually.Do not provide or show your assignment work to any other person - apart from the teaching staff of this course. Ifyou knowingly provide or show your assignment work to another person for any reason, and work derived from it issubmitted, you may be penalized, even if the work was submitted without your knowledge or consent. Pay attentionthat is also your duty to protect your code artifacts . if you are using any online solution to store your code artifacts(e.g., GitHub) then make sure to keep the repository private and do not share access to anyone.Reminder: Plagiarism is defined as (https://student.unsw.edu.au/plagiarism) using the words or ideas of others andpresenting them as your own. UNSW and CSE treat plagiarism as academic misconduct, which means that it carriespenalties as severe as being excluded from further study at UNSW. There are several on-line sources to help youunderstand what plagiarism is and how it is dealt with at UNSW:Plagiarism and Academic Integrity (https://student.unsw.edu.au/plagiarism)UNSW Plagiarism Procedure (https://www.gs.unsw.edu.au/policy/documents/plagiarismprocedure.pdf)Make sure that you read and understand these. Ignorance is not accepted as an excuse for plagiarism. In particular,you are also responsible for ensuring that your assignment files are not accessible by anyone but you by setting thecorrect permissions in your CSE directory and code repository, if using one (e.g., Github and similar). Note also thatplagiarism includes paying or asking another person to do a piece of work for you and then submitting it as your ownwork.UNSW has an ongoing commitment to fostering a culture of learning informed by academic integrity. All UNSW staffand students have a responsibility to adhere to this principle of academic integrity. Plagiarism undermines academicintegrity and is not tolerated at UNSW.Comments! (/COMP9321/20T1/forums/search?forum_choice=resource/41977) # (/COMP9321/20T1/forums/resource/41977)$ Add a commentWilliam Ling (/users/z5250621) about 2 hours ago (Thu Mar 05 2020 03:14:01 GMT+1100 (AEDT))For Question 8,Should we keep or remove the names that are (uncredited)?ReplyMohammadali Yaghoubzadehfard (/users/z5138589) about 2 hours ago (Thu Mar 05 2020 03:32:42GMT+1100 (AEDT))Can you give an example?ReplyWilliam Ling (/users/z5250621) about 2 hours ago (Thu Mar 05 2020 03:54:58 GMT+1100 (AEDT)){cast_id: 154, character: (uncredited), credit_id: 586da64c9251412956000c68,gender: 0, id: 106585 }Above is an instance where the character is simply (uncredited)ReplyMohammadali Yaghoubzadehfard (/users/z5138589) about 2 hours ago (Thu Mar 05 202003:59:34 GMT+1100 (AEDT))Keep it as it is;ReplyShreya Anilkumar (/users/z5269287) about 3 hours ago (Thu Mar 05 2020 02:47:03 GMT+1100 (AEDT))HiIn question 8: Clean the cast column by converting the complex value (JSONs) to a commaseparated value.We have to create different columns by splitting the cast column and sort based on the name?ReplyMohammadali Yaghoubzadehfard (/users/z5138589) about 2 hours ago (Thu Mar 05 2020 03:33:29GMT+1100 (AEDT))No, you just need to create a string with all names of the characters concatinated by commaReplyJoseph Aczel (/users/z5194935) about 9 hours ago (Wed Mar 04 2020 20:33:12 GMT+1100 (AEDT))Hi,In all previous database assignments that Ive done we have had an expected results file so that weknow that our data is displaying the same as what the marker expects to see. Will we have a similarfile or script to check our questions with?ReplyMohammadali Yaghoubzadehfard (/users/z5138589) about 8 hours ago (Wed Mar 04 2020 21:43:25GMT+1100 (AEDT))HiThere is no sample answers.Your assignment will be marked manually by tutors.ReplyJoseph Aczel (/users/z5194935) about 2 hours ago (Thu Mar 05 2020 03:20:17 GMT+1100 (AEDT))Does that mean that we will be penalised for mistakes in previous questions?e.g. Because question 5 relies on question 4, errors in question 4 will be present in question5.ReplyMohammadali Yaghoubzadehfard (/users/z5138589) about 2 hours ago (Thu Mar 05 202003:34:44 GMT+1100 (AEDT))No, you should not be penalized for such cases. Tutors will investigate both results andcodes;Reply转自：http://www.3zuoye.com/contents/3/4832.html

讲解：COMP9321、dataframe、Python、PythonR|SQL

你可能感兴趣的:(讲解：COMP9321、dataframe、Python、PythonR|SQL)