This task simulates a data analyst’s report. The work needs to be accurate and presented in a professional fashion. You will not be presenting the report in person, so your report must speak for itself. You will be given a data file that you will need to analyze.
Do these tasks in one Jupyter notebook in the order provided. When completed, save the notebook as an HTML file showing all output and submit in Canvas along with your original IPYNB file. Your work should be organized, annotated with markdown cells, and easy to follow. Answers to the individual questions should be easy to find in the report. Avoid excess output by using tail/head or other options where necessary.
Task 1
- Read in the file Popular_Baby_Names.csv
- General Data Layout/Definitions
- Year of Birth: Year for this observation
- Gender: Male/Female
- Ethnicity: Various
- Count: Number of occurrences of this name for year/gender/ethnicity
- Rank: Rank of this name for year/gender/ethnicity
- General Data Layout/Definitions
- Print the shape and the first 10 rows to familiarize yourself with the data.
- Rename the “Year of Birth” & “Child’s First Name” columns to remove the spaces and apostrophe. Explain why managing column names this way can be advantageous.
- Print the data types and describe if they are appropriate for analysis or need to be converted to other types. If they need to be converted, perform the conversions.
Task 2
- After reviewing the data, clean the data as you see appropriate. For this cleaning step, you do not need to show all your work or interim results as you examine the data, but you do need to show any commands you used to alter any of the data in the order you use them. After cleaning, perform the following steps.
- Print and explain basic descriptive statistics. Both grid/numeric styles and graphic/chart styles can be helpful.
- Determine how many unique names are in the data. This should be the number of unique names within the entire dataset – i.e. not specific to year, gender, or ethnicity.
- Sort the data based on Rank, then year, then gender, then ethnicity. Print the first 60 rows.
- Print the ten most popular (greatest count) names in descending order by total count. This should be the most popular names within the entire dataset – i.e. not specific to year, gender, or ethnicity.
- Print the ten most popular (greatest count) names for each gender, not specific to year or ethnicity in descending order by total count.
- Determine how many names were recorded 10 or fewer times. This should be the least popular names within the dataset – i.e. not specific to year, gender or ethnicity.