How does accuracy of a survey depend on sample size and population size?












10












$begingroup$


Which survey is more accurate? Assume the samples are taken perfectly randomly.




  1. A sample of 100 people out of a population of 1000 (sample is 10% of population)


  2. A sample of 1000 people out of a population of 1000000 (sample is 0.1% of population)





I remember my lecturer saying something like "when the sample size is small compared to the population, the accuracy depends almost all on the sample size, the population size is unimportant" . Is there a name for that result? It's quite surprising at first.



I'd love to see some graphs of these functions.



If it helps, here's a concrete example (made up by me).




An unknown proportion p of the population favour candidate Alice. The rest favour Bob. We take a random sample size k of the population (size n), and ask their preferences, to come up with an estimate p-hat.



How does the expected error $mathbb{E}|hat{p} - p|$ depend on k and n? And in the limit $ntoinfty$?











share|cite|improve this question











$endgroup$








  • 1




    $begingroup$
    Look up 'finite population correction factor'.
    $endgroup$
    – copper.hat
    Sep 11 '14 at 23:38










  • $begingroup$
    Population size doesn't fix the sample size . The more important thing is "How much variability is present in the population." For an example , if you go to buy a sack of rice , you will take the same sample size if you go to buy only 3-4 kg rice as you have the belief that the variability in the sack is same that of the variability in the 3-4 kg rice .
    $endgroup$
    – user 31466
    Jul 11 '15 at 15:21
















10












$begingroup$


Which survey is more accurate? Assume the samples are taken perfectly randomly.




  1. A sample of 100 people out of a population of 1000 (sample is 10% of population)


  2. A sample of 1000 people out of a population of 1000000 (sample is 0.1% of population)





I remember my lecturer saying something like "when the sample size is small compared to the population, the accuracy depends almost all on the sample size, the population size is unimportant" . Is there a name for that result? It's quite surprising at first.



I'd love to see some graphs of these functions.



If it helps, here's a concrete example (made up by me).




An unknown proportion p of the population favour candidate Alice. The rest favour Bob. We take a random sample size k of the population (size n), and ask their preferences, to come up with an estimate p-hat.



How does the expected error $mathbb{E}|hat{p} - p|$ depend on k and n? And in the limit $ntoinfty$?











share|cite|improve this question











$endgroup$








  • 1




    $begingroup$
    Look up 'finite population correction factor'.
    $endgroup$
    – copper.hat
    Sep 11 '14 at 23:38










  • $begingroup$
    Population size doesn't fix the sample size . The more important thing is "How much variability is present in the population." For an example , if you go to buy a sack of rice , you will take the same sample size if you go to buy only 3-4 kg rice as you have the belief that the variability in the sack is same that of the variability in the 3-4 kg rice .
    $endgroup$
    – user 31466
    Jul 11 '15 at 15:21














10












10








10


6



$begingroup$


Which survey is more accurate? Assume the samples are taken perfectly randomly.




  1. A sample of 100 people out of a population of 1000 (sample is 10% of population)


  2. A sample of 1000 people out of a population of 1000000 (sample is 0.1% of population)





I remember my lecturer saying something like "when the sample size is small compared to the population, the accuracy depends almost all on the sample size, the population size is unimportant" . Is there a name for that result? It's quite surprising at first.



I'd love to see some graphs of these functions.



If it helps, here's a concrete example (made up by me).




An unknown proportion p of the population favour candidate Alice. The rest favour Bob. We take a random sample size k of the population (size n), and ask their preferences, to come up with an estimate p-hat.



How does the expected error $mathbb{E}|hat{p} - p|$ depend on k and n? And in the limit $ntoinfty$?











share|cite|improve this question











$endgroup$




Which survey is more accurate? Assume the samples are taken perfectly randomly.




  1. A sample of 100 people out of a population of 1000 (sample is 10% of population)


  2. A sample of 1000 people out of a population of 1000000 (sample is 0.1% of population)





I remember my lecturer saying something like "when the sample size is small compared to the population, the accuracy depends almost all on the sample size, the population size is unimportant" . Is there a name for that result? It's quite surprising at first.



I'd love to see some graphs of these functions.



If it helps, here's a concrete example (made up by me).




An unknown proportion p of the population favour candidate Alice. The rest favour Bob. We take a random sample size k of the population (size n), and ask their preferences, to come up with an estimate p-hat.



How does the expected error $mathbb{E}|hat{p} - p|$ depend on k and n? And in the limit $ntoinfty$?








statistics sampling






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited Jan 10 at 14:45







Colonel Panic

















asked Sep 10 '14 at 16:17









Colonel PanicColonel Panic

5761625




5761625








  • 1




    $begingroup$
    Look up 'finite population correction factor'.
    $endgroup$
    – copper.hat
    Sep 11 '14 at 23:38










  • $begingroup$
    Population size doesn't fix the sample size . The more important thing is "How much variability is present in the population." For an example , if you go to buy a sack of rice , you will take the same sample size if you go to buy only 3-4 kg rice as you have the belief that the variability in the sack is same that of the variability in the 3-4 kg rice .
    $endgroup$
    – user 31466
    Jul 11 '15 at 15:21














  • 1




    $begingroup$
    Look up 'finite population correction factor'.
    $endgroup$
    – copper.hat
    Sep 11 '14 at 23:38










  • $begingroup$
    Population size doesn't fix the sample size . The more important thing is "How much variability is present in the population." For an example , if you go to buy a sack of rice , you will take the same sample size if you go to buy only 3-4 kg rice as you have the belief that the variability in the sack is same that of the variability in the 3-4 kg rice .
    $endgroup$
    – user 31466
    Jul 11 '15 at 15:21








1




1




$begingroup$
Look up 'finite population correction factor'.
$endgroup$
– copper.hat
Sep 11 '14 at 23:38




$begingroup$
Look up 'finite population correction factor'.
$endgroup$
– copper.hat
Sep 11 '14 at 23:38












$begingroup$
Population size doesn't fix the sample size . The more important thing is "How much variability is present in the population." For an example , if you go to buy a sack of rice , you will take the same sample size if you go to buy only 3-4 kg rice as you have the belief that the variability in the sack is same that of the variability in the 3-4 kg rice .
$endgroup$
– user 31466
Jul 11 '15 at 15:21




$begingroup$
Population size doesn't fix the sample size . The more important thing is "How much variability is present in the population." For an example , if you go to buy a sack of rice , you will take the same sample size if you go to buy only 3-4 kg rice as you have the belief that the variability in the sack is same that of the variability in the 3-4 kg rice .
$endgroup$
– user 31466
Jul 11 '15 at 15:21










2 Answers
2






active

oldest

votes


















8





+50







$begingroup$

Since you asked for graphs, I will make your example even a little more
specific. Suppose that in the sample we take, $60%$ of the voters prefer Alice.
For a sample size of $100,$ that means we had $60$ responses for Alice.



So it looks like Alice is ahead of Bob, doesn't it?
But how sure are we?
To assess the accuracy of our poll,
we can try setting up a one-sided $95%$ confidence interval
by finding the smallest percentage of the population
that might prefer Alice such that there is at least a $5%$ chance that
we would have gotten at least $60$ responses favoring Alice out
of $100$ total responses.



If the population size were $100,$ we would have polled the entire
population, and the confidence interval would start at $60%$
(zero error).
If the size of the population were just $105$,
of whom only $61$ ($58.1%$) favored Alice, there would still be
better than a $5%$ chance that we could get $60$ or more responses for Alice;
but if fewer people favored Alice then there would be
less than a $5%$ chance of $60$ favorable responses.
So we set the $95%$ confidence interval at $58.1%$ or greater,
which is a maximum error of $1.9%$ in the direction of overcounting
(that is, our estimate could be that much higher than the true percentage).



As the population size increases, the maximum error (in one direction)
increases rapidly at first but soon starts to level off.
The chart below shows errors for population sizes up to $3000$.



enter image description here



For a population of $1000$, our sample of $100$ (with $60$ favorable responses)
has a (one-sided) error of $8.2%$ at the $95%$ confidence level.
But the error grows very slowly after that.
In fact it will never be larger than about $8.7%$,
even for populations of a million or more.
There is some effect of increasing the population size above
$1000$ in this example, but it is not very dramatic.



On the other hand, suppose we have a population of one million voters,
take a sample of $k$ voters from this population, and find that
$60%$ of them prefer Alice.
If $k=100$ then we have a one-sided error of
about $8.7%$ at the $95%$ confidence level,
but if we make $k$ larger, the error decreases as shown in the chart below.



enter image description here



For $k=1000$ the error is only about $2.6%$.
That is, while for a sample of $100$ it made little difference
whether the population was $1000$ or one million,
it makes quite a lot of difference whether the sample size is
$100$ or $1000$.



We could do a similar analysis with two-sided confidence intervals,
where we look at possible errors in either direction.
The results would tell much the same story:
little degradation in accuracy for the $100$-voter sample as the
population grows above $1000,$ but great improvement in accuracy
when we increase the sample size.



That's what your lecturer meant by saying the accuracy depends
almost entirely on sample size
when the sample size is small relative to the population.
The only time the population size made much difference in our example
(changing the bounds of the confidence interval by
more than one percentage point)
was when the sample was considerably more than $10%$ of the entire population.






share|cite|improve this answer











$endgroup$





















    7












    $begingroup$

    Note: For convenience only I use in the following $N$ for the size of the population and $n$ for the sample size.



    In order to answer OPs questions we start with some preliminary work and describe the current situation in somewhat more detail.




    Current situation:



    Here we have a simple random sampling, meaning that every possible combination of $n$ units from a population of size $N$ is equally likely to be the sample selected.



    We are in a sampling situation where the object is to estimate the proportion of units in a population having some attributes. In such a situation, the variable of interest is an indicator variable: $y_i=1$ if unit $i$ has the attribute, and $y_i=0$ if it does not.



    Writing $p$ for the proportion in the population of size $N$ with the attribute
    begin{align*}
    p=frac{1}{N}sum_{i=1}^{N}y_i=mu
    end{align*}
    the finite population variance is
    begin{align*}
    sigma^2&=frac{sum_{i=1}^{N-1}(y_i-p)^2}{N-1}=frac{sum_{i=1}^{N-1}y_i^2-Np^2}{N-1}
    =frac{Np-Np^2}{N-1}\
    &=frac{N}{N-1}p(1-p)
    end{align*}
    Now letting $hat{p}$ denote the proportion in the sample of size $n$ with the attribute
    begin{align*}
    hat{p}=frac{1}{n}sum_{i=1}^n{y_i}=bar{y}
    end{align*}
    the sample variance is
    begin{align*}
    s^2&=frac{sum_{i=1}^{n-1}(y_i-bar{y})^2}{n-1}=frac{sum_{i=1}^{n-1}y_i^2-nhat{p}^2}{n-1}\
    &=frac{n}{n-1}hat{p}(1-hat{p})\
    end{align*}



    Note the sample proportion is the sample mean of a simple random sample, it is unbiased for the population proportion and has variance
    begin{align*}
    mathop{var}(hat{p})=frac{N-n}{N-1}cdotfrac{p(1-p)}{n}tag{1}
    end{align*}




    Before we can answer OPs questions we have to do some general




    Considerations regarding accuracy:



    Suppose that one wishes to estimate a population parameter $theta$ - for example the population mean or total or proportion of an attribute of the units of the population with an estimator $hat{theta}$. Then we would wish the estimate to be close to the true value with high probability.



    So, specifying a maximum allowable difference $d$ between the estimate and the true value and allowing for a small probability $alpha$ that the error may exceed that difference, the challenge is to choose a sample size $n$ such that
    begin{align*}
    P(|hat{theta}-theta|>d)<alphatag{2}
    end{align*}
    If the estimator $hat{theta}$ is an unbiased, normally distributed estimator of $theta$, then $frac{hat{theta}-theta}{sqrt{mathop{var}(hat{theta})}}$ has a standard normal distribution. Letting $z$ denote the upper $frac{alpha}{2}$ point of the standard normal distribution yields
    begin{align*}
    Pleft(frac{|hat{theta}-theta|}{sqrt{mathop{var}(hat{theta})}}>zright)
    =Pleft(|hat{theta}-theta|>zsqrt{mathop{var}(hat{theta})}right)=alpha
    end{align*}




    Now, since $d$ and the expression (2) provide us with a precise idea of accuracy, we are ready to harvest.



    Observe, that the variance of the estimator $hat{theta}$ decreases with an increasing sample size $n$, so that the inequality above will be satisfied if we can choose $n$ large enough to make
    begin{align*}
    zsqrt{mathop{var}(hat{theta})}leq dtag{3}
    end{align*}



    These are the relevant parameters to deal with accuracy. Next we consider




    Sample size $n$ for estimating a proportion:



    To obtain an estimator $hat{p}$ having probability at least $1-alpha$ of being no farther then $d$ from the population proportion, the sample size formula based on the normal approximation gives according to (1) and (3)
    begin{align*}
    mathop{var}(hat{theta})&=frac{d^2}{z^2}\
    frac{N-n}{N-1}cdotfrac{p(1-p)}{n}&=frac{d^2}{z^2}\
    end{align*}
    We obtain by setting $n_0=frac{z^2}{d^2}p(1-p)$
    begin{align*}
    n=frac{1}{frac{N-1}{N}cdotfrac{1}{n_0}+frac{1}{N}}tag{3}
    end{align*}



    Note that the formula depends on the unknown population proportion $p$. Since no estimate of $p$ is available, a worst-case value of $p=frac{1}{2}$ can be used in determining the sample size. This approach is justified since the quantity $p(1-p)$, and hence the value of $n$ assumes its maximum value when $p=frac{1}{2}$.




    Note: When $N$ is large compared with the sample size $n$ then formula (3) reduces to



    begin{align*}
    n&simeq limlimits_{Nrightarrow infty}frac{1}{frac{N-1}{N}cdotfrac{1}{n_0}+frac{1}{N}}=n_0
    end{align*}
    Since then $n=n_0$ we obtain
    begin{align*}
    n=frac{z^2}{d^2}p(1-p)tag{4}
    end{align*}
    and we see in accordance with OPs lecturer, that in case the sample size $n$ is small compared with the population size the accuracy $d$ depends on the sample only.



    With regard to one of OPs questions I'm not aware of a specific term for this circumstance. But, sometimes this is named finite population correction.




    Which scenario is more accurate:



    To answer this question we now transform (3) to obtain the difference $d$



    begin{align*}
    d=zsqrt{frac{N-n}{(N-1)n}p(1-p)}
    end{align*}



    Assuming an estimation for the true proportion with probability $0.95$ ($alpha=0.05$) and taking the worst-case probability $p=0.5$ we obtain the formula



    begin{align*}
    d=1.96sqrt{frac{N-n}{(N-1)n}cdotfrac{1}{2}cdotfrac{1}{2}}=0.98sqrt{frac{N-n}{(N-1)n}}
    end{align*}



    We observe in case 1: $N=1000, n=100$
    begin{align*}
    d=0.98sqrt{frac{900}{999cdot100}}simeq 0.0930
    end{align*}
    and in case 2: $N=100000, n=1000$
    begin{align*}
    d=0.98sqrt{frac{999000}{999999cdot1090}}simeq 0.0310
    end{align*}



    and conclude, that the accuracy of case 2 is greater than that of case 1 provided the interpretation is according to the modeling above.




    Note: This answer is mostly based upon Sampling, chapter 5: Estimating Proportions, Ratios and Subpopulation Means by Steven K. Thompson.






    share|cite|improve this answer











    $endgroup$













      Your Answer





      StackExchange.ifUsing("editor", function () {
      return StackExchange.using("mathjaxEditing", function () {
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
      });
      });
      }, "mathjax-editing");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "69"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      noCode: true, onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f926478%2fhow-does-accuracy-of-a-survey-depend-on-sample-size-and-population-size%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      8





      +50







      $begingroup$

      Since you asked for graphs, I will make your example even a little more
      specific. Suppose that in the sample we take, $60%$ of the voters prefer Alice.
      For a sample size of $100,$ that means we had $60$ responses for Alice.



      So it looks like Alice is ahead of Bob, doesn't it?
      But how sure are we?
      To assess the accuracy of our poll,
      we can try setting up a one-sided $95%$ confidence interval
      by finding the smallest percentage of the population
      that might prefer Alice such that there is at least a $5%$ chance that
      we would have gotten at least $60$ responses favoring Alice out
      of $100$ total responses.



      If the population size were $100,$ we would have polled the entire
      population, and the confidence interval would start at $60%$
      (zero error).
      If the size of the population were just $105$,
      of whom only $61$ ($58.1%$) favored Alice, there would still be
      better than a $5%$ chance that we could get $60$ or more responses for Alice;
      but if fewer people favored Alice then there would be
      less than a $5%$ chance of $60$ favorable responses.
      So we set the $95%$ confidence interval at $58.1%$ or greater,
      which is a maximum error of $1.9%$ in the direction of overcounting
      (that is, our estimate could be that much higher than the true percentage).



      As the population size increases, the maximum error (in one direction)
      increases rapidly at first but soon starts to level off.
      The chart below shows errors for population sizes up to $3000$.



      enter image description here



      For a population of $1000$, our sample of $100$ (with $60$ favorable responses)
      has a (one-sided) error of $8.2%$ at the $95%$ confidence level.
      But the error grows very slowly after that.
      In fact it will never be larger than about $8.7%$,
      even for populations of a million or more.
      There is some effect of increasing the population size above
      $1000$ in this example, but it is not very dramatic.



      On the other hand, suppose we have a population of one million voters,
      take a sample of $k$ voters from this population, and find that
      $60%$ of them prefer Alice.
      If $k=100$ then we have a one-sided error of
      about $8.7%$ at the $95%$ confidence level,
      but if we make $k$ larger, the error decreases as shown in the chart below.



      enter image description here



      For $k=1000$ the error is only about $2.6%$.
      That is, while for a sample of $100$ it made little difference
      whether the population was $1000$ or one million,
      it makes quite a lot of difference whether the sample size is
      $100$ or $1000$.



      We could do a similar analysis with two-sided confidence intervals,
      where we look at possible errors in either direction.
      The results would tell much the same story:
      little degradation in accuracy for the $100$-voter sample as the
      population grows above $1000,$ but great improvement in accuracy
      when we increase the sample size.



      That's what your lecturer meant by saying the accuracy depends
      almost entirely on sample size
      when the sample size is small relative to the population.
      The only time the population size made much difference in our example
      (changing the bounds of the confidence interval by
      more than one percentage point)
      was when the sample was considerably more than $10%$ of the entire population.






      share|cite|improve this answer











      $endgroup$


















        8





        +50







        $begingroup$

        Since you asked for graphs, I will make your example even a little more
        specific. Suppose that in the sample we take, $60%$ of the voters prefer Alice.
        For a sample size of $100,$ that means we had $60$ responses for Alice.



        So it looks like Alice is ahead of Bob, doesn't it?
        But how sure are we?
        To assess the accuracy of our poll,
        we can try setting up a one-sided $95%$ confidence interval
        by finding the smallest percentage of the population
        that might prefer Alice such that there is at least a $5%$ chance that
        we would have gotten at least $60$ responses favoring Alice out
        of $100$ total responses.



        If the population size were $100,$ we would have polled the entire
        population, and the confidence interval would start at $60%$
        (zero error).
        If the size of the population were just $105$,
        of whom only $61$ ($58.1%$) favored Alice, there would still be
        better than a $5%$ chance that we could get $60$ or more responses for Alice;
        but if fewer people favored Alice then there would be
        less than a $5%$ chance of $60$ favorable responses.
        So we set the $95%$ confidence interval at $58.1%$ or greater,
        which is a maximum error of $1.9%$ in the direction of overcounting
        (that is, our estimate could be that much higher than the true percentage).



        As the population size increases, the maximum error (in one direction)
        increases rapidly at first but soon starts to level off.
        The chart below shows errors for population sizes up to $3000$.



        enter image description here



        For a population of $1000$, our sample of $100$ (with $60$ favorable responses)
        has a (one-sided) error of $8.2%$ at the $95%$ confidence level.
        But the error grows very slowly after that.
        In fact it will never be larger than about $8.7%$,
        even for populations of a million or more.
        There is some effect of increasing the population size above
        $1000$ in this example, but it is not very dramatic.



        On the other hand, suppose we have a population of one million voters,
        take a sample of $k$ voters from this population, and find that
        $60%$ of them prefer Alice.
        If $k=100$ then we have a one-sided error of
        about $8.7%$ at the $95%$ confidence level,
        but if we make $k$ larger, the error decreases as shown in the chart below.



        enter image description here



        For $k=1000$ the error is only about $2.6%$.
        That is, while for a sample of $100$ it made little difference
        whether the population was $1000$ or one million,
        it makes quite a lot of difference whether the sample size is
        $100$ or $1000$.



        We could do a similar analysis with two-sided confidence intervals,
        where we look at possible errors in either direction.
        The results would tell much the same story:
        little degradation in accuracy for the $100$-voter sample as the
        population grows above $1000,$ but great improvement in accuracy
        when we increase the sample size.



        That's what your lecturer meant by saying the accuracy depends
        almost entirely on sample size
        when the sample size is small relative to the population.
        The only time the population size made much difference in our example
        (changing the bounds of the confidence interval by
        more than one percentage point)
        was when the sample was considerably more than $10%$ of the entire population.






        share|cite|improve this answer











        $endgroup$
















          8





          +50







          8





          +50



          8




          +50



          $begingroup$

          Since you asked for graphs, I will make your example even a little more
          specific. Suppose that in the sample we take, $60%$ of the voters prefer Alice.
          For a sample size of $100,$ that means we had $60$ responses for Alice.



          So it looks like Alice is ahead of Bob, doesn't it?
          But how sure are we?
          To assess the accuracy of our poll,
          we can try setting up a one-sided $95%$ confidence interval
          by finding the smallest percentage of the population
          that might prefer Alice such that there is at least a $5%$ chance that
          we would have gotten at least $60$ responses favoring Alice out
          of $100$ total responses.



          If the population size were $100,$ we would have polled the entire
          population, and the confidence interval would start at $60%$
          (zero error).
          If the size of the population were just $105$,
          of whom only $61$ ($58.1%$) favored Alice, there would still be
          better than a $5%$ chance that we could get $60$ or more responses for Alice;
          but if fewer people favored Alice then there would be
          less than a $5%$ chance of $60$ favorable responses.
          So we set the $95%$ confidence interval at $58.1%$ or greater,
          which is a maximum error of $1.9%$ in the direction of overcounting
          (that is, our estimate could be that much higher than the true percentage).



          As the population size increases, the maximum error (in one direction)
          increases rapidly at first but soon starts to level off.
          The chart below shows errors for population sizes up to $3000$.



          enter image description here



          For a population of $1000$, our sample of $100$ (with $60$ favorable responses)
          has a (one-sided) error of $8.2%$ at the $95%$ confidence level.
          But the error grows very slowly after that.
          In fact it will never be larger than about $8.7%$,
          even for populations of a million or more.
          There is some effect of increasing the population size above
          $1000$ in this example, but it is not very dramatic.



          On the other hand, suppose we have a population of one million voters,
          take a sample of $k$ voters from this population, and find that
          $60%$ of them prefer Alice.
          If $k=100$ then we have a one-sided error of
          about $8.7%$ at the $95%$ confidence level,
          but if we make $k$ larger, the error decreases as shown in the chart below.



          enter image description here



          For $k=1000$ the error is only about $2.6%$.
          That is, while for a sample of $100$ it made little difference
          whether the population was $1000$ or one million,
          it makes quite a lot of difference whether the sample size is
          $100$ or $1000$.



          We could do a similar analysis with two-sided confidence intervals,
          where we look at possible errors in either direction.
          The results would tell much the same story:
          little degradation in accuracy for the $100$-voter sample as the
          population grows above $1000,$ but great improvement in accuracy
          when we increase the sample size.



          That's what your lecturer meant by saying the accuracy depends
          almost entirely on sample size
          when the sample size is small relative to the population.
          The only time the population size made much difference in our example
          (changing the bounds of the confidence interval by
          more than one percentage point)
          was when the sample was considerably more than $10%$ of the entire population.






          share|cite|improve this answer











          $endgroup$



          Since you asked for graphs, I will make your example even a little more
          specific. Suppose that in the sample we take, $60%$ of the voters prefer Alice.
          For a sample size of $100,$ that means we had $60$ responses for Alice.



          So it looks like Alice is ahead of Bob, doesn't it?
          But how sure are we?
          To assess the accuracy of our poll,
          we can try setting up a one-sided $95%$ confidence interval
          by finding the smallest percentage of the population
          that might prefer Alice such that there is at least a $5%$ chance that
          we would have gotten at least $60$ responses favoring Alice out
          of $100$ total responses.



          If the population size were $100,$ we would have polled the entire
          population, and the confidence interval would start at $60%$
          (zero error).
          If the size of the population were just $105$,
          of whom only $61$ ($58.1%$) favored Alice, there would still be
          better than a $5%$ chance that we could get $60$ or more responses for Alice;
          but if fewer people favored Alice then there would be
          less than a $5%$ chance of $60$ favorable responses.
          So we set the $95%$ confidence interval at $58.1%$ or greater,
          which is a maximum error of $1.9%$ in the direction of overcounting
          (that is, our estimate could be that much higher than the true percentage).



          As the population size increases, the maximum error (in one direction)
          increases rapidly at first but soon starts to level off.
          The chart below shows errors for population sizes up to $3000$.



          enter image description here



          For a population of $1000$, our sample of $100$ (with $60$ favorable responses)
          has a (one-sided) error of $8.2%$ at the $95%$ confidence level.
          But the error grows very slowly after that.
          In fact it will never be larger than about $8.7%$,
          even for populations of a million or more.
          There is some effect of increasing the population size above
          $1000$ in this example, but it is not very dramatic.



          On the other hand, suppose we have a population of one million voters,
          take a sample of $k$ voters from this population, and find that
          $60%$ of them prefer Alice.
          If $k=100$ then we have a one-sided error of
          about $8.7%$ at the $95%$ confidence level,
          but if we make $k$ larger, the error decreases as shown in the chart below.



          enter image description here



          For $k=1000$ the error is only about $2.6%$.
          That is, while for a sample of $100$ it made little difference
          whether the population was $1000$ or one million,
          it makes quite a lot of difference whether the sample size is
          $100$ or $1000$.



          We could do a similar analysis with two-sided confidence intervals,
          where we look at possible errors in either direction.
          The results would tell much the same story:
          little degradation in accuracy for the $100$-voter sample as the
          population grows above $1000,$ but great improvement in accuracy
          when we increase the sample size.



          That's what your lecturer meant by saying the accuracy depends
          almost entirely on sample size
          when the sample size is small relative to the population.
          The only time the population size made much difference in our example
          (changing the bounds of the confidence interval by
          more than one percentage point)
          was when the sample was considerably more than $10%$ of the entire population.







          share|cite|improve this answer














          share|cite|improve this answer



          share|cite|improve this answer








          edited Jul 16 '15 at 1:22

























          answered Jul 12 '15 at 3:16









          David KDavid K

          53.9k342116




          53.9k342116























              7












              $begingroup$

              Note: For convenience only I use in the following $N$ for the size of the population and $n$ for the sample size.



              In order to answer OPs questions we start with some preliminary work and describe the current situation in somewhat more detail.




              Current situation:



              Here we have a simple random sampling, meaning that every possible combination of $n$ units from a population of size $N$ is equally likely to be the sample selected.



              We are in a sampling situation where the object is to estimate the proportion of units in a population having some attributes. In such a situation, the variable of interest is an indicator variable: $y_i=1$ if unit $i$ has the attribute, and $y_i=0$ if it does not.



              Writing $p$ for the proportion in the population of size $N$ with the attribute
              begin{align*}
              p=frac{1}{N}sum_{i=1}^{N}y_i=mu
              end{align*}
              the finite population variance is
              begin{align*}
              sigma^2&=frac{sum_{i=1}^{N-1}(y_i-p)^2}{N-1}=frac{sum_{i=1}^{N-1}y_i^2-Np^2}{N-1}
              =frac{Np-Np^2}{N-1}\
              &=frac{N}{N-1}p(1-p)
              end{align*}
              Now letting $hat{p}$ denote the proportion in the sample of size $n$ with the attribute
              begin{align*}
              hat{p}=frac{1}{n}sum_{i=1}^n{y_i}=bar{y}
              end{align*}
              the sample variance is
              begin{align*}
              s^2&=frac{sum_{i=1}^{n-1}(y_i-bar{y})^2}{n-1}=frac{sum_{i=1}^{n-1}y_i^2-nhat{p}^2}{n-1}\
              &=frac{n}{n-1}hat{p}(1-hat{p})\
              end{align*}



              Note the sample proportion is the sample mean of a simple random sample, it is unbiased for the population proportion and has variance
              begin{align*}
              mathop{var}(hat{p})=frac{N-n}{N-1}cdotfrac{p(1-p)}{n}tag{1}
              end{align*}




              Before we can answer OPs questions we have to do some general




              Considerations regarding accuracy:



              Suppose that one wishes to estimate a population parameter $theta$ - for example the population mean or total or proportion of an attribute of the units of the population with an estimator $hat{theta}$. Then we would wish the estimate to be close to the true value with high probability.



              So, specifying a maximum allowable difference $d$ between the estimate and the true value and allowing for a small probability $alpha$ that the error may exceed that difference, the challenge is to choose a sample size $n$ such that
              begin{align*}
              P(|hat{theta}-theta|>d)<alphatag{2}
              end{align*}
              If the estimator $hat{theta}$ is an unbiased, normally distributed estimator of $theta$, then $frac{hat{theta}-theta}{sqrt{mathop{var}(hat{theta})}}$ has a standard normal distribution. Letting $z$ denote the upper $frac{alpha}{2}$ point of the standard normal distribution yields
              begin{align*}
              Pleft(frac{|hat{theta}-theta|}{sqrt{mathop{var}(hat{theta})}}>zright)
              =Pleft(|hat{theta}-theta|>zsqrt{mathop{var}(hat{theta})}right)=alpha
              end{align*}




              Now, since $d$ and the expression (2) provide us with a precise idea of accuracy, we are ready to harvest.



              Observe, that the variance of the estimator $hat{theta}$ decreases with an increasing sample size $n$, so that the inequality above will be satisfied if we can choose $n$ large enough to make
              begin{align*}
              zsqrt{mathop{var}(hat{theta})}leq dtag{3}
              end{align*}



              These are the relevant parameters to deal with accuracy. Next we consider




              Sample size $n$ for estimating a proportion:



              To obtain an estimator $hat{p}$ having probability at least $1-alpha$ of being no farther then $d$ from the population proportion, the sample size formula based on the normal approximation gives according to (1) and (3)
              begin{align*}
              mathop{var}(hat{theta})&=frac{d^2}{z^2}\
              frac{N-n}{N-1}cdotfrac{p(1-p)}{n}&=frac{d^2}{z^2}\
              end{align*}
              We obtain by setting $n_0=frac{z^2}{d^2}p(1-p)$
              begin{align*}
              n=frac{1}{frac{N-1}{N}cdotfrac{1}{n_0}+frac{1}{N}}tag{3}
              end{align*}



              Note that the formula depends on the unknown population proportion $p$. Since no estimate of $p$ is available, a worst-case value of $p=frac{1}{2}$ can be used in determining the sample size. This approach is justified since the quantity $p(1-p)$, and hence the value of $n$ assumes its maximum value when $p=frac{1}{2}$.




              Note: When $N$ is large compared with the sample size $n$ then formula (3) reduces to



              begin{align*}
              n&simeq limlimits_{Nrightarrow infty}frac{1}{frac{N-1}{N}cdotfrac{1}{n_0}+frac{1}{N}}=n_0
              end{align*}
              Since then $n=n_0$ we obtain
              begin{align*}
              n=frac{z^2}{d^2}p(1-p)tag{4}
              end{align*}
              and we see in accordance with OPs lecturer, that in case the sample size $n$ is small compared with the population size the accuracy $d$ depends on the sample only.



              With regard to one of OPs questions I'm not aware of a specific term for this circumstance. But, sometimes this is named finite population correction.




              Which scenario is more accurate:



              To answer this question we now transform (3) to obtain the difference $d$



              begin{align*}
              d=zsqrt{frac{N-n}{(N-1)n}p(1-p)}
              end{align*}



              Assuming an estimation for the true proportion with probability $0.95$ ($alpha=0.05$) and taking the worst-case probability $p=0.5$ we obtain the formula



              begin{align*}
              d=1.96sqrt{frac{N-n}{(N-1)n}cdotfrac{1}{2}cdotfrac{1}{2}}=0.98sqrt{frac{N-n}{(N-1)n}}
              end{align*}



              We observe in case 1: $N=1000, n=100$
              begin{align*}
              d=0.98sqrt{frac{900}{999cdot100}}simeq 0.0930
              end{align*}
              and in case 2: $N=100000, n=1000$
              begin{align*}
              d=0.98sqrt{frac{999000}{999999cdot1090}}simeq 0.0310
              end{align*}



              and conclude, that the accuracy of case 2 is greater than that of case 1 provided the interpretation is according to the modeling above.




              Note: This answer is mostly based upon Sampling, chapter 5: Estimating Proportions, Ratios and Subpopulation Means by Steven K. Thompson.






              share|cite|improve this answer











              $endgroup$


















                7












                $begingroup$

                Note: For convenience only I use in the following $N$ for the size of the population and $n$ for the sample size.



                In order to answer OPs questions we start with some preliminary work and describe the current situation in somewhat more detail.




                Current situation:



                Here we have a simple random sampling, meaning that every possible combination of $n$ units from a population of size $N$ is equally likely to be the sample selected.



                We are in a sampling situation where the object is to estimate the proportion of units in a population having some attributes. In such a situation, the variable of interest is an indicator variable: $y_i=1$ if unit $i$ has the attribute, and $y_i=0$ if it does not.



                Writing $p$ for the proportion in the population of size $N$ with the attribute
                begin{align*}
                p=frac{1}{N}sum_{i=1}^{N}y_i=mu
                end{align*}
                the finite population variance is
                begin{align*}
                sigma^2&=frac{sum_{i=1}^{N-1}(y_i-p)^2}{N-1}=frac{sum_{i=1}^{N-1}y_i^2-Np^2}{N-1}
                =frac{Np-Np^2}{N-1}\
                &=frac{N}{N-1}p(1-p)
                end{align*}
                Now letting $hat{p}$ denote the proportion in the sample of size $n$ with the attribute
                begin{align*}
                hat{p}=frac{1}{n}sum_{i=1}^n{y_i}=bar{y}
                end{align*}
                the sample variance is
                begin{align*}
                s^2&=frac{sum_{i=1}^{n-1}(y_i-bar{y})^2}{n-1}=frac{sum_{i=1}^{n-1}y_i^2-nhat{p}^2}{n-1}\
                &=frac{n}{n-1}hat{p}(1-hat{p})\
                end{align*}



                Note the sample proportion is the sample mean of a simple random sample, it is unbiased for the population proportion and has variance
                begin{align*}
                mathop{var}(hat{p})=frac{N-n}{N-1}cdotfrac{p(1-p)}{n}tag{1}
                end{align*}




                Before we can answer OPs questions we have to do some general




                Considerations regarding accuracy:



                Suppose that one wishes to estimate a population parameter $theta$ - for example the population mean or total or proportion of an attribute of the units of the population with an estimator $hat{theta}$. Then we would wish the estimate to be close to the true value with high probability.



                So, specifying a maximum allowable difference $d$ between the estimate and the true value and allowing for a small probability $alpha$ that the error may exceed that difference, the challenge is to choose a sample size $n$ such that
                begin{align*}
                P(|hat{theta}-theta|>d)<alphatag{2}
                end{align*}
                If the estimator $hat{theta}$ is an unbiased, normally distributed estimator of $theta$, then $frac{hat{theta}-theta}{sqrt{mathop{var}(hat{theta})}}$ has a standard normal distribution. Letting $z$ denote the upper $frac{alpha}{2}$ point of the standard normal distribution yields
                begin{align*}
                Pleft(frac{|hat{theta}-theta|}{sqrt{mathop{var}(hat{theta})}}>zright)
                =Pleft(|hat{theta}-theta|>zsqrt{mathop{var}(hat{theta})}right)=alpha
                end{align*}




                Now, since $d$ and the expression (2) provide us with a precise idea of accuracy, we are ready to harvest.



                Observe, that the variance of the estimator $hat{theta}$ decreases with an increasing sample size $n$, so that the inequality above will be satisfied if we can choose $n$ large enough to make
                begin{align*}
                zsqrt{mathop{var}(hat{theta})}leq dtag{3}
                end{align*}



                These are the relevant parameters to deal with accuracy. Next we consider




                Sample size $n$ for estimating a proportion:



                To obtain an estimator $hat{p}$ having probability at least $1-alpha$ of being no farther then $d$ from the population proportion, the sample size formula based on the normal approximation gives according to (1) and (3)
                begin{align*}
                mathop{var}(hat{theta})&=frac{d^2}{z^2}\
                frac{N-n}{N-1}cdotfrac{p(1-p)}{n}&=frac{d^2}{z^2}\
                end{align*}
                We obtain by setting $n_0=frac{z^2}{d^2}p(1-p)$
                begin{align*}
                n=frac{1}{frac{N-1}{N}cdotfrac{1}{n_0}+frac{1}{N}}tag{3}
                end{align*}



                Note that the formula depends on the unknown population proportion $p$. Since no estimate of $p$ is available, a worst-case value of $p=frac{1}{2}$ can be used in determining the sample size. This approach is justified since the quantity $p(1-p)$, and hence the value of $n$ assumes its maximum value when $p=frac{1}{2}$.




                Note: When $N$ is large compared with the sample size $n$ then formula (3) reduces to



                begin{align*}
                n&simeq limlimits_{Nrightarrow infty}frac{1}{frac{N-1}{N}cdotfrac{1}{n_0}+frac{1}{N}}=n_0
                end{align*}
                Since then $n=n_0$ we obtain
                begin{align*}
                n=frac{z^2}{d^2}p(1-p)tag{4}
                end{align*}
                and we see in accordance with OPs lecturer, that in case the sample size $n$ is small compared with the population size the accuracy $d$ depends on the sample only.



                With regard to one of OPs questions I'm not aware of a specific term for this circumstance. But, sometimes this is named finite population correction.




                Which scenario is more accurate:



                To answer this question we now transform (3) to obtain the difference $d$



                begin{align*}
                d=zsqrt{frac{N-n}{(N-1)n}p(1-p)}
                end{align*}



                Assuming an estimation for the true proportion with probability $0.95$ ($alpha=0.05$) and taking the worst-case probability $p=0.5$ we obtain the formula



                begin{align*}
                d=1.96sqrt{frac{N-n}{(N-1)n}cdotfrac{1}{2}cdotfrac{1}{2}}=0.98sqrt{frac{N-n}{(N-1)n}}
                end{align*}



                We observe in case 1: $N=1000, n=100$
                begin{align*}
                d=0.98sqrt{frac{900}{999cdot100}}simeq 0.0930
                end{align*}
                and in case 2: $N=100000, n=1000$
                begin{align*}
                d=0.98sqrt{frac{999000}{999999cdot1090}}simeq 0.0310
                end{align*}



                and conclude, that the accuracy of case 2 is greater than that of case 1 provided the interpretation is according to the modeling above.




                Note: This answer is mostly based upon Sampling, chapter 5: Estimating Proportions, Ratios and Subpopulation Means by Steven K. Thompson.






                share|cite|improve this answer











                $endgroup$
















                  7












                  7








                  7





                  $begingroup$

                  Note: For convenience only I use in the following $N$ for the size of the population and $n$ for the sample size.



                  In order to answer OPs questions we start with some preliminary work and describe the current situation in somewhat more detail.




                  Current situation:



                  Here we have a simple random sampling, meaning that every possible combination of $n$ units from a population of size $N$ is equally likely to be the sample selected.



                  We are in a sampling situation where the object is to estimate the proportion of units in a population having some attributes. In such a situation, the variable of interest is an indicator variable: $y_i=1$ if unit $i$ has the attribute, and $y_i=0$ if it does not.



                  Writing $p$ for the proportion in the population of size $N$ with the attribute
                  begin{align*}
                  p=frac{1}{N}sum_{i=1}^{N}y_i=mu
                  end{align*}
                  the finite population variance is
                  begin{align*}
                  sigma^2&=frac{sum_{i=1}^{N-1}(y_i-p)^2}{N-1}=frac{sum_{i=1}^{N-1}y_i^2-Np^2}{N-1}
                  =frac{Np-Np^2}{N-1}\
                  &=frac{N}{N-1}p(1-p)
                  end{align*}
                  Now letting $hat{p}$ denote the proportion in the sample of size $n$ with the attribute
                  begin{align*}
                  hat{p}=frac{1}{n}sum_{i=1}^n{y_i}=bar{y}
                  end{align*}
                  the sample variance is
                  begin{align*}
                  s^2&=frac{sum_{i=1}^{n-1}(y_i-bar{y})^2}{n-1}=frac{sum_{i=1}^{n-1}y_i^2-nhat{p}^2}{n-1}\
                  &=frac{n}{n-1}hat{p}(1-hat{p})\
                  end{align*}



                  Note the sample proportion is the sample mean of a simple random sample, it is unbiased for the population proportion and has variance
                  begin{align*}
                  mathop{var}(hat{p})=frac{N-n}{N-1}cdotfrac{p(1-p)}{n}tag{1}
                  end{align*}




                  Before we can answer OPs questions we have to do some general




                  Considerations regarding accuracy:



                  Suppose that one wishes to estimate a population parameter $theta$ - for example the population mean or total or proportion of an attribute of the units of the population with an estimator $hat{theta}$. Then we would wish the estimate to be close to the true value with high probability.



                  So, specifying a maximum allowable difference $d$ between the estimate and the true value and allowing for a small probability $alpha$ that the error may exceed that difference, the challenge is to choose a sample size $n$ such that
                  begin{align*}
                  P(|hat{theta}-theta|>d)<alphatag{2}
                  end{align*}
                  If the estimator $hat{theta}$ is an unbiased, normally distributed estimator of $theta$, then $frac{hat{theta}-theta}{sqrt{mathop{var}(hat{theta})}}$ has a standard normal distribution. Letting $z$ denote the upper $frac{alpha}{2}$ point of the standard normal distribution yields
                  begin{align*}
                  Pleft(frac{|hat{theta}-theta|}{sqrt{mathop{var}(hat{theta})}}>zright)
                  =Pleft(|hat{theta}-theta|>zsqrt{mathop{var}(hat{theta})}right)=alpha
                  end{align*}




                  Now, since $d$ and the expression (2) provide us with a precise idea of accuracy, we are ready to harvest.



                  Observe, that the variance of the estimator $hat{theta}$ decreases with an increasing sample size $n$, so that the inequality above will be satisfied if we can choose $n$ large enough to make
                  begin{align*}
                  zsqrt{mathop{var}(hat{theta})}leq dtag{3}
                  end{align*}



                  These are the relevant parameters to deal with accuracy. Next we consider




                  Sample size $n$ for estimating a proportion:



                  To obtain an estimator $hat{p}$ having probability at least $1-alpha$ of being no farther then $d$ from the population proportion, the sample size formula based on the normal approximation gives according to (1) and (3)
                  begin{align*}
                  mathop{var}(hat{theta})&=frac{d^2}{z^2}\
                  frac{N-n}{N-1}cdotfrac{p(1-p)}{n}&=frac{d^2}{z^2}\
                  end{align*}
                  We obtain by setting $n_0=frac{z^2}{d^2}p(1-p)$
                  begin{align*}
                  n=frac{1}{frac{N-1}{N}cdotfrac{1}{n_0}+frac{1}{N}}tag{3}
                  end{align*}



                  Note that the formula depends on the unknown population proportion $p$. Since no estimate of $p$ is available, a worst-case value of $p=frac{1}{2}$ can be used in determining the sample size. This approach is justified since the quantity $p(1-p)$, and hence the value of $n$ assumes its maximum value when $p=frac{1}{2}$.




                  Note: When $N$ is large compared with the sample size $n$ then formula (3) reduces to



                  begin{align*}
                  n&simeq limlimits_{Nrightarrow infty}frac{1}{frac{N-1}{N}cdotfrac{1}{n_0}+frac{1}{N}}=n_0
                  end{align*}
                  Since then $n=n_0$ we obtain
                  begin{align*}
                  n=frac{z^2}{d^2}p(1-p)tag{4}
                  end{align*}
                  and we see in accordance with OPs lecturer, that in case the sample size $n$ is small compared with the population size the accuracy $d$ depends on the sample only.



                  With regard to one of OPs questions I'm not aware of a specific term for this circumstance. But, sometimes this is named finite population correction.




                  Which scenario is more accurate:



                  To answer this question we now transform (3) to obtain the difference $d$



                  begin{align*}
                  d=zsqrt{frac{N-n}{(N-1)n}p(1-p)}
                  end{align*}



                  Assuming an estimation for the true proportion with probability $0.95$ ($alpha=0.05$) and taking the worst-case probability $p=0.5$ we obtain the formula



                  begin{align*}
                  d=1.96sqrt{frac{N-n}{(N-1)n}cdotfrac{1}{2}cdotfrac{1}{2}}=0.98sqrt{frac{N-n}{(N-1)n}}
                  end{align*}



                  We observe in case 1: $N=1000, n=100$
                  begin{align*}
                  d=0.98sqrt{frac{900}{999cdot100}}simeq 0.0930
                  end{align*}
                  and in case 2: $N=100000, n=1000$
                  begin{align*}
                  d=0.98sqrt{frac{999000}{999999cdot1090}}simeq 0.0310
                  end{align*}



                  and conclude, that the accuracy of case 2 is greater than that of case 1 provided the interpretation is according to the modeling above.




                  Note: This answer is mostly based upon Sampling, chapter 5: Estimating Proportions, Ratios and Subpopulation Means by Steven K. Thompson.






                  share|cite|improve this answer











                  $endgroup$



                  Note: For convenience only I use in the following $N$ for the size of the population and $n$ for the sample size.



                  In order to answer OPs questions we start with some preliminary work and describe the current situation in somewhat more detail.




                  Current situation:



                  Here we have a simple random sampling, meaning that every possible combination of $n$ units from a population of size $N$ is equally likely to be the sample selected.



                  We are in a sampling situation where the object is to estimate the proportion of units in a population having some attributes. In such a situation, the variable of interest is an indicator variable: $y_i=1$ if unit $i$ has the attribute, and $y_i=0$ if it does not.



                  Writing $p$ for the proportion in the population of size $N$ with the attribute
                  begin{align*}
                  p=frac{1}{N}sum_{i=1}^{N}y_i=mu
                  end{align*}
                  the finite population variance is
                  begin{align*}
                  sigma^2&=frac{sum_{i=1}^{N-1}(y_i-p)^2}{N-1}=frac{sum_{i=1}^{N-1}y_i^2-Np^2}{N-1}
                  =frac{Np-Np^2}{N-1}\
                  &=frac{N}{N-1}p(1-p)
                  end{align*}
                  Now letting $hat{p}$ denote the proportion in the sample of size $n$ with the attribute
                  begin{align*}
                  hat{p}=frac{1}{n}sum_{i=1}^n{y_i}=bar{y}
                  end{align*}
                  the sample variance is
                  begin{align*}
                  s^2&=frac{sum_{i=1}^{n-1}(y_i-bar{y})^2}{n-1}=frac{sum_{i=1}^{n-1}y_i^2-nhat{p}^2}{n-1}\
                  &=frac{n}{n-1}hat{p}(1-hat{p})\
                  end{align*}



                  Note the sample proportion is the sample mean of a simple random sample, it is unbiased for the population proportion and has variance
                  begin{align*}
                  mathop{var}(hat{p})=frac{N-n}{N-1}cdotfrac{p(1-p)}{n}tag{1}
                  end{align*}




                  Before we can answer OPs questions we have to do some general




                  Considerations regarding accuracy:



                  Suppose that one wishes to estimate a population parameter $theta$ - for example the population mean or total or proportion of an attribute of the units of the population with an estimator $hat{theta}$. Then we would wish the estimate to be close to the true value with high probability.



                  So, specifying a maximum allowable difference $d$ between the estimate and the true value and allowing for a small probability $alpha$ that the error may exceed that difference, the challenge is to choose a sample size $n$ such that
                  begin{align*}
                  P(|hat{theta}-theta|>d)<alphatag{2}
                  end{align*}
                  If the estimator $hat{theta}$ is an unbiased, normally distributed estimator of $theta$, then $frac{hat{theta}-theta}{sqrt{mathop{var}(hat{theta})}}$ has a standard normal distribution. Letting $z$ denote the upper $frac{alpha}{2}$ point of the standard normal distribution yields
                  begin{align*}
                  Pleft(frac{|hat{theta}-theta|}{sqrt{mathop{var}(hat{theta})}}>zright)
                  =Pleft(|hat{theta}-theta|>zsqrt{mathop{var}(hat{theta})}right)=alpha
                  end{align*}




                  Now, since $d$ and the expression (2) provide us with a precise idea of accuracy, we are ready to harvest.



                  Observe, that the variance of the estimator $hat{theta}$ decreases with an increasing sample size $n$, so that the inequality above will be satisfied if we can choose $n$ large enough to make
                  begin{align*}
                  zsqrt{mathop{var}(hat{theta})}leq dtag{3}
                  end{align*}



                  These are the relevant parameters to deal with accuracy. Next we consider




                  Sample size $n$ for estimating a proportion:



                  To obtain an estimator $hat{p}$ having probability at least $1-alpha$ of being no farther then $d$ from the population proportion, the sample size formula based on the normal approximation gives according to (1) and (3)
                  begin{align*}
                  mathop{var}(hat{theta})&=frac{d^2}{z^2}\
                  frac{N-n}{N-1}cdotfrac{p(1-p)}{n}&=frac{d^2}{z^2}\
                  end{align*}
                  We obtain by setting $n_0=frac{z^2}{d^2}p(1-p)$
                  begin{align*}
                  n=frac{1}{frac{N-1}{N}cdotfrac{1}{n_0}+frac{1}{N}}tag{3}
                  end{align*}



                  Note that the formula depends on the unknown population proportion $p$. Since no estimate of $p$ is available, a worst-case value of $p=frac{1}{2}$ can be used in determining the sample size. This approach is justified since the quantity $p(1-p)$, and hence the value of $n$ assumes its maximum value when $p=frac{1}{2}$.




                  Note: When $N$ is large compared with the sample size $n$ then formula (3) reduces to



                  begin{align*}
                  n&simeq limlimits_{Nrightarrow infty}frac{1}{frac{N-1}{N}cdotfrac{1}{n_0}+frac{1}{N}}=n_0
                  end{align*}
                  Since then $n=n_0$ we obtain
                  begin{align*}
                  n=frac{z^2}{d^2}p(1-p)tag{4}
                  end{align*}
                  and we see in accordance with OPs lecturer, that in case the sample size $n$ is small compared with the population size the accuracy $d$ depends on the sample only.



                  With regard to one of OPs questions I'm not aware of a specific term for this circumstance. But, sometimes this is named finite population correction.




                  Which scenario is more accurate:



                  To answer this question we now transform (3) to obtain the difference $d$



                  begin{align*}
                  d=zsqrt{frac{N-n}{(N-1)n}p(1-p)}
                  end{align*}



                  Assuming an estimation for the true proportion with probability $0.95$ ($alpha=0.05$) and taking the worst-case probability $p=0.5$ we obtain the formula



                  begin{align*}
                  d=1.96sqrt{frac{N-n}{(N-1)n}cdotfrac{1}{2}cdotfrac{1}{2}}=0.98sqrt{frac{N-n}{(N-1)n}}
                  end{align*}



                  We observe in case 1: $N=1000, n=100$
                  begin{align*}
                  d=0.98sqrt{frac{900}{999cdot100}}simeq 0.0930
                  end{align*}
                  and in case 2: $N=100000, n=1000$
                  begin{align*}
                  d=0.98sqrt{frac{999000}{999999cdot1090}}simeq 0.0310
                  end{align*}



                  and conclude, that the accuracy of case 2 is greater than that of case 1 provided the interpretation is according to the modeling above.




                  Note: This answer is mostly based upon Sampling, chapter 5: Estimating Proportions, Ratios and Subpopulation Means by Steven K. Thompson.







                  share|cite|improve this answer














                  share|cite|improve this answer



                  share|cite|improve this answer








                  edited Jul 11 '15 at 17:27

























                  answered Jul 11 '15 at 16:27









                  Markus ScheuerMarkus Scheuer

                  61.3k456146




                  61.3k456146






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Mathematics Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      Use MathJax to format equations. MathJax reference.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f926478%2fhow-does-accuracy-of-a-survey-depend-on-sample-size-and-population-size%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      MongoDB - Not Authorized To Execute Command

                      How to fix TextFormField cause rebuild widget in Flutter

                      Npm cannot find a required file even through it is in the searched directory