Sampling.Rmd

---
title: "Sampling"
---

```{r options_communes, include=FALSE}
source("options_communes.R")
```


<div class="important">

The sampling strategy is constrainted by both available budget, field accessibility and time. The sampling approach reflects a trade off between representativity of the results, rapid delivery and  


</div>


### Declaration Bias


## Sampling strategy

Sampling strategy can be either probabilistic or non-probabilistic. a good introduction can be found [here](http://www.fao.org/docrep/W3241E/w3241e08.htm)

### Non-probabilistic approaches

Non-probabilistic approaches are usually **favored during the emergency phase** where both time and field access represent the main challenge. 

#### Convenience sampling 
It is a frequently used method in emergency situations. It relies on sampling those respondents who are easiest to access.

Practically speaking those couldd be either:
  * key Informant that are ready to report by themself 
  
  * Individuals or household among those who have settled along roadsides, or who present themselves to administrative center of the returnee settlement or the assistance desk etc.  
  
 * Advantages: Easy and quick to implement, especially when time and access are the main constraints. 
 
 * Disadvantage: The danger with this type of data collection approach is that it will often lead to biased results as the sample may not be representative of the majority, i.e. those with the most resources or power are often the ones who settle in the most easily accessible areas.


#### Snowball sampling

Snowball sampling (or [chain sampling, chain-referral sampling, referral sampling](https://en.wikipedia.org/wiki/Snowball_sampling)) is a non-probability sampling technique where existing study subjects recruit future subjects from among their acquaintances. This technique is subject to numerous biases. For example, people who have many friends are more likely to be recruited into the sample. 

 * Advantages: Useful when targeting specific groups that might be difficult to reach (hidden population).

 * Disadvantage: This approach might underweight the most vulnerable individuals.

#### Purposive sampling
It is based on previous knowledge about who might be able to provide valuable or specific information. It uses the judgement of community representatives, project staff or assessors to select typical locations and/or informants. The sampling of children or women, for example, is a type of purposive sampling.

Purposive sampling can also be done through Key Informant.

 * Advantages: Moderately rigorous if well and clear criteria for sampling are followed. Useful when targeting specific groups of affected population or specific affected areas. Less time consuming and less expensive than representative sampling.

 * Disadvantage: Generalisations are biased and not recommended. Samples are not representative of population due to subjectivity of respondents.

The risk of loosing certain componnent of the population can be addressed by defining strata within the purposive sample.


In the case of Desk interview or key Informant, the more observation the better. Some kind of [credibility scoring](http://iomiraqdtm.info/Downloads/00-%20DTM%20Methodology%20Documents/DTM_LA_Credibility_Scoring_Methodology.pdf) can be obtained for each locations based on a review of the key informant. 


#### Quota sample

A quota sample might be representative of the population (if quotas actually do work, which they not always do). But a quota sample will never satisfy the strict randomness requirements that statistics require. 
Only if we are working with a random sample can we make inferences from the sample to the population. In quota samples, there is not sufficient randomness, as the interviewer selects the interviewees actively. 
Therefore, quota samples cannot be used to reason about the general population.


### Probalistics approaches

Whenever the situation is becoming more **protracted**, probabilistic approaches should be favored. They will allow to generate more reliable results.

#### Respondent-driven sampling -RDS

A declination of snowball sampling is the [Respondent-driven sampling -RDS](http://www.respondentdrivensampling.org/) approach. It combines "snowball sampling" with a mathematical model that weights the sample to compensate for the fact that the sample was collected in a non-random way. As such it can be classified as probabilistic approach. The advantage is that seeds selection is specific and does not require sample frame.

While data requirements for RDS analysis are minimal, there are three pieces of information which are essential for analysis (RDS analysis CANNOT BE PERFORMED without these fields for each respondent):

* Personal Network Size (Degree) - Number of people the respondent knows within the target population. 

* Respondent's Serial Number - Serial number of the coupon the respondent was recruited with. 

* Respondent's Recruiting Serial Numbers - Serial numbers from the coupons the respondent is given to recruit others. 


#### Time-Location Sampling

The Time-Location Sampling (TLS) approach can be used when the goal is to have a representation of population in movement. The idea and the assumption is to sample persons at locations and at time at which they may be found. 

Time-location sampling is used to sample a population for which a sampling frame cannot be constructed but locations are known at which the population of interest can be found, or for which it is more efficient to sample at these locations. As such the approach is likely appropriate when the survey is taking place at a  **border**. 

More practical guidelines for TLS are available in a dedicated [Resource Guide TLS](http://globalhealthsciences.ucsf.edu/sites/default/files/content/pphg/surveillance/modules/global-trainings/tls-res-guide-2nd-edition.pdf) and some application on Border Monitoring for [tourism](http://meetings.sis-statistica.org/index.php/sm/sm2012/paper/viewFile/2180/149) or [illegal migrants](https://books.google.jo/books?id=Gz9eAgAAQBAJ&pg=PA53&lpg=PA53&dq=Border+surveys+and+Time+Location+Sampling&source=bl&ots=6i5IgC-2Mb&sig=P3CdG8-LvC0Y_LCK-MZ047gAJNQ&hl=en&sa=X&redir_esc=y#v=onepage&q=Border%20surveys%20and%20Time%20Location%20Sampling&f=false). 


#### Random sampling 

If you need a purely random sample, the size of the sample is a calculation that takes 3 variables:

*	Size of the full population. In refugee Context, Data is coming from proGres while in IDP context, data is coming from a Displacement Tracking System.

*	Confidence level: for what proportion of the population you want to get the right estimation (usually either 90%, 95% or 99%) 

*	Error Margin (or confidence interval): How much error are you willing to tolerate for each questions? i.e. + or – your estimated ratio for each questions on the top of the confidence interval (usually either 5%, 2% or 1%)

There are [online calculator](https://www.surveymonkey.com/mp/sample-size-calculator/) for this. Alternatively one can use the excel formula from this [example](http://archive.snh.gov.uk/vmm/Resources/R38%20SAMPLE%20SIZE%20CALCULATOR.xls) 


For 400,000 Syrians  | 	5% error margin 	| 		2% error margin		| 	1% error margin
---------------------|--------------------|-----------------------| ----------------- 
90% Confidence level |	272               |	1694                  |	6692
95% Confidence level |	384             	| 2387                 	| 9379
99% Confidence level |	662               |	4105                  |	15929


For 150,000 Afghans	| 	5% error margin 	| 		2% error margin		| 	1% error margin
--------------------|---------------------|-----------------------|------------------ 
90% Confidence level|	272                 |	1682                  |	6511
95% Confidence level|	383                 |	2363                  |	9026
99% Confidence level|	661                 |	4036                  |	14937


Usually the decision on the right confidence level and error margin to be selected is also influenced by cost implication and the final usage of the figures that is looked for.


#### Stratified sampling  

You can refer to this [Introduction video](https://www.youtube.com/watch?v=WakK8Wzmw6o&list=PLyLpEs0x9BnmPTE2RRRJW058Nf7R_2xQa&index=5) or this [presentation](http://ocw.jhsph.edu/courses/StatMethodsForSampleSurveys/PDFs/Lecture4.pdf) and this [one formt eh WFP VAM](https://resources.vam.wfp.org/sites/default/files/mVAM_Generic%20training%20for%20live%20call%20operators.pptx).

A stratified random sample can only be carried out if a complete list of the population is available. In stratified sampling the population is partitioned into groups, called strata, and sampling is performed separately within each stratum.

 This can be done for the following reasons: 

 * Population groups may have different values for the responses of interest.
 
 * If we want to improve our estimation for each group separately.
 
 * To ensure adequate sample size for each group.

In stratified sampling designs, it is assumed that: 

 * stratum variables are mutually exclusive (non-over lapping), e.g., urban/rural areas, economic categories, geographic regions,  race, sex, etc. 
 
 * the population (elements) should be homogenous within-stratum, and
 
 * the population (elements) should be heterogenous between the strata.
 
The major task of stratified sampling design is the appropriate allocation of samples to different strata. The different types of allocation methods includes: 

 * **Equal allocation**: Divide the number of sample units n equally among the k strata. This implies to use “weighted analysis” (disproportionate selection).
 
 * **Proportional to stratum size**: Make the proportion of each stratum sampled identical to the proportion of the population. A major disadvantage of proportional 
allocation is that sample size in a stratum may be low – provide and provide unreliable stratum-specific results. In terms of analysis, data will be Self-weighted (equal proportion from each stratum). 

 * Allocation based on **variance differences among the strata** (called Optimal allocation). Optimal allocation minimizes the overall variance for a specified cost, or equivalently minimizes the overall cost for a specified variance. In situations where the standard deviations of the strata are known it may be advantageous to make a disproportionate allocation. Suppose that, we had stratum A and stratum B, but we know that the individuals assigned to stratum A were more varied with respect to their opinions than those assigned to stratum B. Optimum allocation minimises the standard error of the estimated mean by ensuring that more respondents are assigned to the stratum within which there is greatest variation. Stratum variances are susually defined by previous surveys. This approach also implies to use “weighted analysis” (disproportionate selection).
 
 * Allocation based on the **relative cost of each survey record** (called Neyman Allocation).Neyman allocation is a special case of optimal allocation where the costs per unit are the same for all strata. In this case, the ideal sample allocation allow to maximize precision, given a Stratified Sample With a fixed Sample Size. The ideal sample allocation plan would provide the most precision for the least cost. This implies  to sample more heavily from a stratum when the cost to sample an element from the stratum is low, the population size of the stratum is large or the variability within the stratum is large.This approach also implies to use “weighted analysis” (disproportionate selection).
 
 
Typically, when developping the stata definition, in case of optimal or neyman allocation, i.e. when stratea variance are already known through a previous survey, the following objectives can be looked at:
 * Find minimum sample size, given a fixed error
 * Find minimum error, given a fixed sample size
 * Find minimum error, given a fixed budget
 * Find minimum cost to achieve a fixed error
 

Typical workflow to define sample size in case of stratified sampling: 

  1. Choose the stratification (e.g.regions, district,….)
  2. Define the population (N) of each strata
  3. Decide on key indicator(s)
  4. Estimate mean & variance or prevalence of key indicator
  5. Decide on precision and confidence level
  6. Calculate the initial total sample size (n) according to the budget/time
  7. Use simple random sample per strata to select your representative sample
  
To estimate sample size, you need to know: 

 * Estimate of the prevalence or mean & STDev of the key indicator (e.g. 30% return intention). Prevalence is the total number of cases for a variable of interest that is **typically binary** within a population divided by its total population. Mean is the expected value of  a variable of interest that is **typically continuous** within a prescribed range for a given population (e.g. expenditure per case) 
 * Precision desired (for example: ± 5%). Precision is the variability of the estimate. 
 * Level of confidence (for example: 95%). It represents the probability of the same result if you re-sampled, all other things equal. 
 * Population (only if below 10,000, otherwise it will not influence the required sample size) 
 * Expected response rate (for example: 90%) 
 * Number of eligible individuals per household (if applicable) 
 
Stratified sampling can be performed with R. [Tutorial scripts are available here](https://github.com/unhcr-mena/stratified-sampling).

#### Post stratification
One can also use weights, computed through a [post-stratification process](https://www.r-bloggers.com/survey-computing-your-own-post-stratification-weights-in-r/), to get potentially biased surveys, like online surveys, to better fit the underlying population. The only thing that weights can do, is ensure that your sample composition better mimics the general population’s characteristics. Weights will never help you if the process governing non-response is part of the puzzle you want to solve.

In a random sample, we define a population, draw from that population at random and then compute and apply weights to align the sample with the population. This weighting is necessary because some people originally sampled might be e.g. harder to reach than others, thereby biasing the sample. Once the post-stratification weights have been applied, the random sample is representative of the population it was drawn from. Statistics gives us a method to tell just how accurately the findings from the sample can be generalized.


#### Cluster sampling 

Cluster sampling is a technique that allows to reduce the surveying budget when **travel cost are important**. Instead of covering a whole territory, the cluster sampling implies to divide the population into separate groups, called clusters. Then, a simple random sample of clusters is selected from the population.

Cluster sampling are therefore not relevant when techniques such as phone interview are used as there's no marginal surveying cost involved with location of interview.

Given equal sample sizes, cluster sampling usually provides less precision than either simple random sampling or stratified sampling. 

different approaches can be used for cluster sampling

  * One-stage sampling. All of the elements within selected clusters are included in the sample.
  * Two-stage sampling. A subset of elements within each selected cluster is randomly selected for inclusion in the sample.

#### Sampling With Replacement and Sampling Without Replacement

When a population element can be selected more than one time, we are sampling with replacement. When a population element can be selected only one time, we are sampling without replacement.  When we sample with replacement, the two sample values are independent. Practically, this means that what we get on the first one doesn't affect what we get on the second. Mathematically, this means that the covariance between the two is zero.In sampling without replacement, the two sample values aren't independent. Practically, this means that what we got on the for the first one affects what we can get for the second one. Mathematically, this means that the covariance between the two isn't zero. 

In small populations and often in large ones, sampling is typically done "without replacement", i.e., one deliberately avoids choosing any member of the population more than once. Although sampling can be conducted with replacement instead, this is less common and for a small sample from a large population, sampling without replacement is approximately the same as sampling with replacement, since the odds of choosing the same individual twice is low.You can tell how dramatic the results are by calculating the covariance. That’s a measure of how much two items’s probabilities are linked together; the higher the covariance, the more dramatic the results. A covariance of zero would mean there’s no difference between sampling with replacement or sampling without.

As explained in this [paper](http://www.statcan.gc.ca/pub/12-001-x/2001002/article/6089-eng.pdf), bias may be introduced into population estimates through Telephone surveys, however, by the exclusion of nontelephone households from these surveys.The bias introduced can be significant since "non-telephone households" may differ from telephone households in ways that are not adequately handled by poststratification. Many households, called “transients”, move in and out of the telephone population during the year, sometimes due to economic reasons or relocation. The transient telephone population may be representative of the non-telephone population in general since its members have recently been in the
non-telephone population.


## Sample Weight

Over-sampling in regions with small populations ensures that they have a large enough sample to be representative. Under-sampling is done in regions with large populations to save costs. Sample weights are mathematical adjustments applied to the data to correct for over-sampling, under-sampling, and different response rates to the survey in different regions.

### How are the oversampled/undersampled areas corrected in data analysis?

The samples are designed to permit data analysis of regional subsets within the sample population. When the expected number of cases for some of these regions is too small for analysis, it is necessary to oversample those areas. When the expected number of cases for some of these regions is unnecessarily large, those areas may be undersampled to accommodate logistical or budgetary constraints.

During analysis, it is then necessary to "weight down" the oversampled areas and "weight up" the undersampled areas. The developing of the sampling weights has taken this factor into account. Always use the weight variable found in the DHS data set. Even in surveys that come from a self-weighting sample, it is still necessary to use the sampling weights in analysis because the response behavior may differ by response groups.

### What does it mean to normalize the weights?

After the weights are initially calculated, they are normalized, or standardized, by dividing each weight by the average of the initial weights (equal to the sum of the initial weight divided by the sum of the number of cases) so that the sum of the normalized/standardized weights equals the sum of the cases over the entire sample. The standardization is done separately for each weight for the entire sample.  

The entire set of household sample weights is multiplied by a constant, thus, the total weighted number of households equals the total unweighted number of households at the national level.

Individual sample weights are normalized separately for women and men. Thus, the total weighted number of women equals the total unweighted number of women, and the total weighted number of men equals the total unweighted number of men. Women and men are normalized separately because all non-HIV calculations are performed on women and men separately. We do not provide survey estimates on the joint population of women and men combined for anything other than HIV prevalence.