5.21 ① distributions and sampling

US income data

In this mission, we'll be looking at US income data. Each row is a single county in the US. For each county, we have the following columns:

id -- the county id.
county -- the name and state of the county.
pop_over_25 -- the number of adults over age 25.
median_income -- the median income for residents over age 25 in the county.
median_income_no_hs-- median income for residents without a high school education.
median_income_hs -- median income for high school graduates who didn't go to college.
median_income_some_college -- median income for residents who went to college but didn't graduate.
median_income_college-- median income for college graduates.
median_income_graduate_degree-- median income for those with a masters or other graduate degree.

Find the county with the lowest median income in the US (median_income). Assign the name of the county (county) to lowest_income_county.

Find the county that has more than 500000 residents with the lowest median income. Assign the name of the county to lowest_income_high_pop_county.

# The first 5 rows of the data.
print(income.head())
# The .idxmin() function will find the 
# index of the minimum value in a column.
lowest_income_county = income['county'][
    income['median_income'].idxmin()
]

high_pop =income[ income['pop_over_25'] > 500000]

lowest_income_high_pop_county = high_pop[
    'county'][high_pop['median_income'
                      ].idxmin()]

random numbers

import random
# returns a random integer between 
# the numbers 0 and 10 ,inclusive.
### a random integer,one!
num = random.randint(0,10)

# generate a sequence of 
### 10  numbers 
# between  the values of 0 to 10
random_sequence = [random.randint(0,10) for _ in range(10)]

# # After a random seed is set, the numbers generated
# after will follow the same sequence.

random.seed(10)
print([random.randint(0,10) for _ in range(5)])

# same sequence as above
random.seed(10)
print([random.randint(0,10) for _ in range(5)])

# different seed means different sequence
random.seed(11)
print([random.randint(0,10) for _ in range (5)])

# Set a random seed of 20 and generate a list of
# 10 random numbers between the values 0 and 10.
# Assign the list to new_sequence.

random.seed(20)
new_sequence = [random.randint(0,10) for _ in range (10)]

selecting items from a list

# Let's say that we have some data on
# how much shoppers spend in a store.
shopping= [300,200,100,600,20]

# We want to sample the data, 
# and only select 4 elements.

random.seed(1)
shopping_sample = random.sample(shopping,4)
# 4 random items from the shopping list.

population vs sample

# Use [roll() for _ in range(x)] to 
# generate the rolls, 
# with x being the number of rolls. 
# Use plt.hist(sample, 6) to generate the plot.
#  Make sure to set the seed before 
# generating each sequence of rolls.

import matplotlib.pyplot as plt 
# make a function  that returns the 
# results of a  die roll.
# a die roll , one!
def roll():
    return random.randint(1,6)


random.seed(1)
samll_sample =[ roll() for _ in range(10)]

# plot a histogram with 6 bins
# 1 for each possible outcome
# of the die roll.
plt.hist("small_sample",6)
plt.title('small')
plt.show()

#### same seed(n) ,range(m+1)
#### contain the sequece of range(m) 
#### plus new item at last.
random.seed(1)
medium_sample = [roll() for _ in range (100)]

finding the right sample size

As you can see from the graphs above, the probability of rolling a 1 should be around .166. However, we only really noticed the probability reaching this value once we got to 10000 dice rolls. Generally, the lower your sample size, the more variability the probability will have around the "true" probability.

We can graph out this variability by repeatedly rolling the die N times. So we could do 20 trials of rolling the die 10 times, and graph out all the resulting probabilities of rolling a 1. This would tell us how much error we could expect by rolling the die 20 times.

# Use probability_of_one(x, y) to generate 
# the rolls, with x being the number of trials, 
# and y being the number of rolls per trial.

"""
This function will take in the number of trials, 
and the number of rolls per trial.
Then it will conduct each trial,
and record the probability of 
rolling a one.
"""

def probability_of_one(
    num_trials, 
    num_rolls):
    probabilities =[]

    for i in range(num_trials):
        die_rolls = [roll() for _ in range (num_rolls)]
        one_probability = len([d for d in die_rolls if d == 1]) / num_rolls
        probabilities.append(one_probability)
    return probabilities

random.seed(1)
small_sample = probability_of_one(300,50)

plt.hist(small_sample, 20)
plt.ylim(0,70)
plt.xlim(0,0.4)
plt.show()
random.seed(1)
medium_sample = probability_of_one(300, 100)
plt.hist(medium_sample, 20)
plt.ylim(0,70)
plt.xlim(0,0.4)
plt.show()

random.seed(1)
large_sample = probability_of_one(300, 1000)
plt.hist(large_sample, 20)
plt.ylim(0,70)
plt.xlim(0,0.4)
plt.show()

5.21 ① distributions and sampling

population vs sample

finding the right sample size

你可能感兴趣的:(5.21 ① distributions and sampling)