GATHERING DATA

ST101 – DR. ARIC LABARR


GATHERING DATA

Data is everywhere.

With all this data being gathered and stored, we need to understand good practices of gathering data.

Data gathered without thinking ahead of time leaves itself open for problems later.

*IDC Digital Universe 


Main concepts in this section of the course:

Samples and populations

Randomness

Good vs. bad sampling methods

Ethical concerns around data



GATHERING DATA


Why are we collecting data?


WHY DO WE CARE?


Why are we collecting data?

Make better decisions around a group of people, places, things, etc.


Why would data help with that?


WHY DO WE CARE?


Why are we collecting data?

Make better decisions around a group of people, places, things, etc.


Why would data help with that?

If data represents the things we are interested in, it can provide insights.


WHY DO WE CARE?


Why are we collecting data?

Make better decisions around a group of people, places, things, etc.


Why would data help with that?

If data represents the things we are interested in, it can provide insights.


WHY DO WE CARE?

This is NOT trivial! It is foundational!


Why are we collecting data?

Make better decisions around a group of people, places, things, etc.


Why would data help with that?

If data represents the things we are interested in, it can provide insights.

If data doesn’t represent the things we are interested in, it can provide misleading results and lead to incorrect decisions.


WHY DO WE CARE?


Imagine you wanted to know the average height of the adult population in the United States because you are designing a new clothing line for adults.

You take a sample of people (subset of people) since you think it will be impossible to ask everyone in the United States what their height is.

Your sample consists entirely of professional basketball players.

Do you see any problem here?

EXAMPLE – HEIGHT


Do you see any problem here?

Professional basketball players are probably taller than most adults in the United States.

If their heights are taller, then our guess will be too tall!

Clothes will not be designed for common adults to wear which will lead to poor sales numbers and wasted resources on producing many clothes that not many people buy.


EXAMPLE – HEIGHT


The data we made decisions from did not represent the people we wanted to serve!

The data wasn’t bad, just collected in a way that didn’t provide the insights we wanted.

How do we ensure we don’t make this mistake?

Samples and populations

Randomness

Good vs. bad sampling methods

EXAMPLE – HEIGHT


Data gathered without thinking ahead of time leaves itself open for problems later.

If data represents the things we are interested in, it can provide insights.

If data doesn’t represent the things we are interested in, it can provide misleading results and lead to incorrect decisions.


SUMMARY


SAMPLES AND POPULATIONS

GATHERING DATA


Before we start gathering data, it is good for us to know who or what we are interested in gathering information about.

Should also consider what we want to know about this group we are interested in.

GATHERING DATA


Population – set of all objects/individuals of interest.

Usually too large to obtain information from entire population.

Example:

Want to know average height of adults in the United States.

Impossible to actually get information from all adults in United States.


POPULATION


Population – set of all objects/individuals of interest.

Usually too large to obtain information from entire population.

Example:

Want to know average height of adults in the United States.

Impossible to actually get information from all adults in United States.


Obtaining information from the whole population is called a census.


POPULATION


Population – set of all objects/individuals of interest.

Example:

Want to know average height of adults in the United States.


Must pay attention to details of the population.

What do you consider an adult?

If this is for marketing a new clothing line, do you want ALL adults? Adults of certain age range? Business or casual? Certain region of the country?

Lots of problems with sampling comes from not fully defining the population.

POPULATION DETAILS


Population – set of all objects/individuals of interest.

Example:

Want to know average height of adults in the United States.


Parameter – measures computed from a population.


POPULATION PARAMETER


Sample – subset of the population that information is actually obtained.

Should represent the population well.


Sampling frame – actual list from which the sample is taken.

May not equal the population.

SAMPLE


Sample – subset of the population that information is actually obtained.

Should represent the population well.


Statistic – measures computed from a sample.

Sample statistics is the point estimate of the population parameter.

Point estimate is a single number estimate of an unknown parameter.

SAMPLE STATISTIC


PUT IT ALL TOGETHER

Population

Sample

Statistic

Parameter

Population – set of all objects/individuals of interest.


21


PUT IT ALL TOGETHER

Population

Sample

Statistic

Parameter

Population – set of all objects/individuals of interest.

Sample – subset of the population that information is actually obtained.


22


PUT IT ALL TOGETHER

Population

Sample

Statistic

Parameter

Population – set of all objects/individuals of interest.

Sample – subset of the population that information is actually obtained.

Statistic – measures computed from a sample.


23


PUT IT ALL TOGETHER

Population

Sample

Statistic

Parameter

Population – set of all objects/individuals of interest.

Sample – subset of the population that information is actually obtained.

Statistic – measures computed from a sample.

Parameter – measures computed from a population.



24


PUT IT ALL TOGETHER

Population

Sample

Statistic

Parameter

Population – set of all objects/individuals of interest.

Sample – subset of the population that information is actually obtained.

Statistic – measures computed from a sample.

Parameter – measures computed from a population.



25


EXAMPLE

A retail chain is trying to determine if a new product they introduced is selling well across their stores. The retail chain has 2135 stores nationwide.  The analyst in charge of this project is tasked to estimate the average daily sales of this new product across all stores. Older computing technology forces the company to randomly pick 179 stores spread evenly throughout the nation to calculate gather data from.  The average daily sales from these 179 stores is $129.19.


Identify population, sample, parameter, statistic.

Any sampling frame issues?


EXAMPLE

A retail chain is trying to determine if a new product they introduced is selling well across their stores. The retail chain has 2135 stores nationwide.  The analyst in charge of this project is tasked to estimate the average daily sales of this new product across all stores. Older computing technology forces the company to randomly pick 179 stores spread evenly throughout the nation to calculate gather data from.  The average daily sales from these 179 stores is $129.19.


Identify population, sample, parameter, statistic.

Any sampling frame issues?


EXAMPLE

A retail chain is trying to determine if a new product they introduced is selling well across their stores. The retail chain has 2135 stores nationwide.  The analyst in charge of this project is tasked to estimate the average daily sales of this new product across all stores. Older computing technology forces the company to randomly pick 179 stores spread evenly throughout the nation to calculate gather data from.  The average daily sales from these 179 stores is $129.19.


Identify population, sample, parameter, statistic.

Any sampling frame issues?


EXAMPLE

A retail chain is trying to determine if a new product they introduced is selling well across their stores. The retail chain has 2135 stores nationwide.  The analyst in charge of this project is tasked to estimate the average daily sales of this new product across all stores. Older computing technology forces the company to randomly pick 179 stores spread evenly throughout the nation to calculate gather data from.  The average daily sales from these 179 stores is $129.19.


Identify population, sample, parameter, statistic.

Any sampling frame issues?


EXAMPLE

A retail chain is trying to determine if a new product they introduced is selling well across their stores. The retail chain has 2135 stores nationwide.  The analyst in charge of this project is tasked to estimate the average daily sales of this new product across all stores. Older computing technology forces the company to randomly pick 179 stores spread evenly throughout the nation to calculate gather data from.  The average daily sales from these 179 stores is $129.19.


Identify population, sample, parameter, statistic.

Any sampling frame issues?


EXAMPLE

A retail chain is trying to determine if a new product they introduced is selling well across their stores. The retail chain has 2135 stores nationwide.  The analyst in charge of this project is tasked to estimate the average daily sales of this new product across all stores. Older computing technology forces the company to randomly pick 179 stores spread evenly throughout the nation to calculate gather data from.  The average daily sales from these 179 stores is $129.19.


Identify population, sample, parameter, statistic.

Any sampling frame issues? NO


Population – set of all objects/individuals of interest.

Sample – subset of the population that information is actually obtained.

Sampling frame – actual list from which the sample is taken.

Statistic – measures computed from a sample.

Parameter – measures computed from a population.


SUMMARY


RANDOMNESS

GATHERING DATA


EXAMPLE

A retail chain is trying to determine if a new product they introduced is selling well across their stores. The retail chain has 2135 stores nationwide.  The analyst in charge of this project is tasked to estimate the average daily sales of this new product across all stores. Older computing technology forces the company to randomly pick 179 stores spread evenly throughout the nation to calculate gather data from.  The average daily sales from these 179 stores is $129.19.


Identify population, sample, parameter, statistic.

Any sampling frame issues? NO


What do you think of with randomness?


RANDOMNESS


What do you think of with randomness?

Not knowing what is going to happen…

Fairness (equal chance for outcomes)…

RANDOMNESS


Random – an outcome is random if we know the particular outcomes that something could have but are unsure of which of those outcomes is about to happen.


RANDOMNESS


Random – an outcome is random if we know the particular outcomes that something could have but are unsure of which of those outcomes is about to happen.


Not knowing what is going to happen…

Kind of true. We know what could happen, but not which of the outcomes will happen.

Flip a fair coin 🡪 could be heads or tails, but not sure which.


RANDOMNESS


Random – an outcome is random if we know the particular outcomes that something could have but are unsure of which of those outcomes is about to happen.


Not knowing what is going to happen…

Kind of true. We know what could happen, but not which of the outcomes will happen.

Flip a fair coin 🡪 could be heads or tails, but not sure which.

Fairness (equal chance of outcomes)…

Could be true, but not required. 

An unfair coin is still random, but the outcomes are not even.

RANDOMNESS


RANDOMNESS AND SAMPLING

Population

Sample

Statistic

Parameter

Having randomness helps make the sample representative of the population.


Protects us from having certain pieces of information overly influence our sample.


RANDOMNESS AND SAMPLING

Population

Sample

Statistic

Parameter

Having a good representative sample means the inference we make from the statistic to the parameter is reasonable!


Random – an outcome is random if we know the particular outcomes that something could have but are unsure of which of those outcomes is about to happen.

Having randomness helps make the sample representative of the population.

Having a good representative sample means the inference we make from the statistic to the parameter is reasonable.


SUMMARY


BAD SAMPLING METHODS

GATHERING DATA


PARAMETERS VS. STATISTICS

Population

Sample

Statistic

Parameter

44


PARAMETERS VS. STATISTICS

Population

Sample

Statistic

Parameter

Need good sampling to…

45


PARAMETERS VS. STATISTICS

Population

Sample

Statistic

Parameter

Need good sampling to…

…have good estimates.

46


SAMPLING

There are many different ways to sample data from population.

Mistakes in sampling can lead to bias in the sample.


Bias – certain outcomes are favored over other outcomes in samples.


TYPES OF BIAS

Bias – certain outcomes are favored over other outcomes in samples.

2 Common Types of Bias:

Selection Bias

Sampling Bias


TYPES OF BIAS

Bias – certain outcomes are favored over other outcomes in samples.

2 Common Types of Bias:

Selection Bias

Sampling Bias


TYPES OF BIAS

Bias – certain outcomes are favored over other outcomes in samples.

2 Common Types of Bias:

Selection Bias

Undercoverage

Nonresponse

Sampling Bias


Undercoverage – sampling frame and population are not equal.


Problem:

Sample doesn’t represent the population of interest.

Incorrect and biased inference is made.


Example – Phone book.


UNDERCOVERAGE


Nonresponse – subject in sample cannot / will not respond or be measured.


Problem:

Those who respond don’t represent the population as a whole.

Incorrect and biased inference is made.


Example – Telemarketers.


NONRESPONSE


TYPES OF BIAS

Bias – certain outcomes are favored over other outcomes in samples.

2 Common Types of Bias:

Selection Bias

Sampling Bias


TYPES OF BIAS

Bias – certain outcomes are favored over other outcomes in samples.

2 Common Types of Bias:

Selection Bias

Sampling Bias

Convenience sampling

Voluntary sampling


Convenience sampling – technique that selects subjects from population based on accessibility and ease.


Problem:

Just because subjects are easy to talk with, doesn’t mean they represent the population of interest as a whole.

Incorrect and biased inference is made.


Example – Shopping store surveyors.

CONVENIENCE SAMPLING


Voluntary sampling – technique where subjects volunteer themselves to sample.


Problem:

People who volunteer don’t necessarily represent the population of interest as a whole.

Incorrect and biased inference is made.


Example – Marriage questionnaire.

VOLUNTARY SAMPLING


Need good sampling to have good estimates.

Bias – certain outcomes are favored over other outcomes in samples.

2 Common Types of Bias:

Selection Bias

Undercoverage

Nonresponse

Sampling Bias

Convenience sampling

Voluntary sampling


SUMMARY


GOOD SAMPLING METHODS

GATHERING DATA


PARAMETERS VS. STATISTICS

Population

Sample

Statistic

Parameter

Need good sampling to…

…have good estimates.

59


STATISTICAL TECHNIQUES

Statistical sampling techniques use selection methods based on chance selection instead of convenience or judgement.

4 Common Techniques:

Simple Random Sampling (SRS)

Stratified Random Sampling

Cluster Sampling

Systematic Sampling


SIMPLE RANDOM SAMPLING (SRS)

A method of sampling items from a population such that every possible sample of a specified size has an equal chance of being selected.


Advantages:

No statistical bias, no previous information about sample needed ahead of time.

Disadvantages:

Expensive, time consuming, hard to implement, need list of population.


STRATIFIED RANDOM SAMPLING (STS)

A method of sampling items where the population is divided beforehand into subgroups, called strata, so that each member in the population belongs to only one strata. Sample items from every strata (with SRS for example).


Advantages:

Smaller sample sizes can achieve same accuracy as SRS, more information about parts of population.

Disadvantages:

Need information about population ahead of time to split on!


A method of sampling items where the population is divided beforehand into subgroups, called clusters, so that each member in the population belongs to only one cluster. Sample items from a sample of m clusters selected randomly.


Advantages:

Overcome issues with travel, time, and expense; Easier to implement than SRS or STS.

Disadvantages:

Need information about population ahead of time to split on – but not total list!

May have slight bias if random clusters aren’t representative.

CLUSTER SAMPLING


A method of sampling items that involves selecting every kth item in the population after randomly selecting a starting point between 1 and k.  

The value k is determined as the ratio of the population size over the desired sample size.


Advantages:

Very easy to get sample.

Disadvantages:

May be biased, especially if order of list of population matters.

SYSTEMATIC SAMPLING


A large worldwide financial company wants to develop a new retirement plan for the company. They want to survey different managers of branches around the world to find out the most important strategies the new retirement plan should contain. They have 5000 branches worldwide and want to personally interview these branch managers. They have information about the branch size (small, medium, large), and the state/province location of the branch. They want to talk to 50 branch managers.


Develop four separate strategies to sample these branch managers based on the four different statistical sampling techniques discussed previously.

EXAMPLE


EXAMPLE – SIMPLE RANDOM SAMPLE

Randomly sample 50 branches to interview their managers.

Need a list of branches to randomly sample from.

Branch List: 1, 2, 3, …, 4998, 4999, 5000

50 Branches: 434, 938, 2582, …, 3218, 3439, 4134 

Sample


EXAMPLE – STRATIFIED RANDOM SAMPLE

 


EXAMPLE – CLUSTER SAMPLING

Split branches up by state.

Randomly sample (SRS) 5 states.

Randomly select (SRS) 10 branches in each state. 


EXAMPLE – CLUSTER SAMPLING

Split branches up by state.

Randomly sample (SRS) 5 states.

Randomly select (SRS) 10 branches in each state. 


Potential bias – what if these 5 states don’t represent the population of all states well?


EXAMPLE – SYSTEMATIC SAMPLING

Split list of branches into groups of 5000 / 50 = 100.

Randomly select (SRS) starting point in first group of 100.

Take same point in each group.

Branch List: 1, 2, 3, …, 4998, 4999, 5000

50 Branches: 9, 109, 209, …, 4709, 4809, 4909 

9

First Group List: 1, 2, 3, …, 98, 99, 100


Develop four separate strategies to sample these branch managers based on the four different statistical sampling techniques discussed previously.

SRS – Randomly sample 50 branches to interview their managers.

STS – Stratify by size and select SRS from each.

Cluster – Randomly select sample of states/provinces, then select branches at random from those states/provinces.

Systematic – Select every 100th branch in list of branches.

EXAMPLE – PUT ALL TOGETHER


Need good sampling to have good estimates.

4 Common Techniques:

Simple Random Sampling (SRS)

Stratified Random Sampling (STS)

Cluster Sampling

Systematic Sampling


SUMMARY


EXPERIMENTS

GATHERING DATA


Data collection studies usually classified as observational or experimental.

Observational – researcher does not interfere or intervene in the process of collecting data.

Requires selecting a sample. 

Experimental – researcher manipulates the conditions in which the study is carried out.

Requires selecting a sample and conducting and designing an experiment.


TYPES OF STUDIES


Imagine you wanted to know the average height of the adult population in the United States because you are designing a new clothing line for adults.


Observational study – just observing what has happened (height) in our population of interest.

OBSERVATIONAL EXAMPLE


In an experiment, the researcher randomly assigns treatments to experimental units.

Factor – variable used to predict that takes on a finite number of values (categorical variable)

Level – setting a factor can take on.

Treatment – specific experimental condition, either the level of a factor (if only 1 factor) or the combinations of the levels from several factors. 


EXPERIMENT TERMINOLOGY


A mechanical engineer wanted to determine which variables influence gas mileage of a certain year and model of a car. 

Gas mileage is the variable we are interested in.

Factors studied: 

Tire pressure (low, standard).

Octane rating of fuel (regular, midgrade, premium).

Held constant the following variables:

Weather conditions.

Route.

Tire type.



EXPERIMENT EXAMPLE


The key thing that makes this study an experimental study is the active role the research plays in manipulating the environment. 

Makes it difficult in some situations to have a true experiment.

Effects of smoking on children?

Effects of family unit income as child for college performance?


EXPERIMENTS


Three key components to a well-designed experiment

1. Randomization – treatments are randomly assigned to experimental units   

2. Replication – multiple subjects are assigned the same treatment  

Subjects with the same treatment are called replicates.

More replication allow us to have more confidence in our study conclusions  


DESIGN OF EXPERIMENTS


Three key components to a well-designed experiment

1. Randomization – treatments are randomly assigned to experimental units   

2. Replication – multiple subjects are assigned the same treatment  

3. Control - some study conditions are held constant in order to reduce variability.

Controlling certain variables (sometimes called nuisances) that can impact what we are interested in.

This makes it easier to see differences due to our treatments


DESIGN OF EXPERIMENTS


Observational study – researcher does not interfere or intervene in the process of collecting data.

Experimental study – researcher manipulates the conditions in which the study is carried out.

Three key components to a well-designed experiment

Randomization

Replication

Control

SUMMARY


DATA ETHICS

GATHERING DATA


The gathering of data leads to questions around the ethical collection and use of that data.

As Christians we are held to an even higher standard around ethical considerations.

GATHERING DATA


In observational studies / experiments we must keep the interest of the subject we are collecting data from at the forefront. 


1964 Helsinki Declaration of the World Medical Association:

“The interests of the subject must always prevail over the interests of science and society.”



AREAS OF ETHICAL CONCERNS


Collection of data:

Institutional review boards

Informed consent

Confidentiality


SAFEGUARDS


People have to exist that have the best interest of the subjects of the data collection in mind.

Medical studies require institutional review boards to evaluate every study before it is conducted so that subjects are not put into any harm.

These are not required for a lot of business studies, but the people collecting the data SHOULD take the subject into account before any data collection is performed.


INSTITUTIONAL REVIEW BOARDS


Informed – subject should be told what data is needed from them and what potential outcomes come from the data being given to the people collecting it. 

Must ensure that ALL information is shared.

May be hard for those gathering the data since they believe in their work and its usefulness. 

Must always consider the risks.

INFORMED CONSENT


Consent – after being informed, subjects must agree to the collection of data (usually in writing).

Who can give consent?

What about children? Mentally ill subjects? 

Some are afraid that consent is harder to come by if you reveal ALL possible bad outcomes, no matter how unlikely. Is this bad? 

INFORMED CONSENT


Once data is collected, privacy is VERY IMPORTANT!

Confidentiality – the subjects in the data have their identifying information masked. 

You can report overall statistics about data that is gathered, but not who belonged to a certain outcome (unless you are reporting results to others who own the data).


Many stories of confidential data being leaked due to computer hacking.

CONFIDENTIALITY


Anonymity – identifying information about the subjects is NEVER known in the data collection.


Anonymity is more private than confidentiality!

ANONYMITY VS. CONFIDENTIALITY


You want to know which website design will work better to get people to click on your products.  You randomly show one of the two websites to people who visit your website to measure which design performs better.

Any concerns around…

Institutional review?

Informed consent?

Confidentiality?

WEBSITE TESTING EXAMPLE


You wear a watch that tracks your heartrate and sends that information off to the company. That company uses the information to determine trends and characteristics of people at risk for heart disease. 

Any concerns around…

Institutional review?

Informed consent?

Confidentiality?

WEARABLE MEDICAL DEVICE EXAMPLE


The gathering of data leads to questions around the ethical collection and use of that data.

As Christians we are held to an even higher standard around ethical considerations.

In observational studies / experiments we must keep the interest of the subject we are collecting data from at the forefront. 

Collection of data:

Institutional review boards

Informed consent

Confidentiality


SUMMARY


COLLECTING DATA INTUITION

GATHERING DATA


Main concepts in this section of the course:

Samples and populations

Randomness

Good vs. bad sampling methods

Ethical concerns around data



GATHERING DATA


Who are you REALLY interested in gathering data around?

The biggest problem with setting a population is not providing enough detail.

Be very detailed and it will same you time later on.

INTUITION – POPULATION OF INTEREST


Does your sample represent your population?

Good sampling methods that involve randomness help you get a sample that represents the population. 

Still good practice to explore your data to make sure it looks like the population in a commonsense way.

Example:

Possible to randomly get REALLY lucky and select only NBA players for your height study. 

However, upon investigation, you realize that your sample probably isn’t right, so you take another sample.

INTUITION – REPRESENTATIVE SAMPLE


Does your sampling favor certain outcomes over others?

Its always good to think about your sampling method to make sure you haven’t built in any bias.

Make sure your sampling method have randomness to help protect you against bias.

INTUITION – GOOD SAMPLING


Can anyone be harmed or burdened by the collection and use of your data?

Think about the possible harm the collection of your data could have.

You must be open and honest with people you are collecting data on.

Remember, God holds us to a higher standard than the world, let’s represent Him well!

INTUITION – ETHICAL CONSIDERATIONS


It is EXTREMELY hard to protect yourself and consider all these things by yourself.

ASK FOR HELP!

I always like to ask others who I know (especially if they have different perspectives and experiences than I do) to make sure I am not missing anything.

INTUITION – OVERALL 


Intuition and careful thought can protect you a lot of times when it comes to data gathering.

Use other people to help make sure you are considering all the things you need to.

SUMMARY


Last modified: Monday, October 17, 2022, 12:53 PM