statistics: sampling from database

I have a patient registry, this is a database where patients with specific cancer volunteer to share  their medical history.
I am doing statistics based on this database, now
is this a biased sampl? why?
how can i minimize the bias of infeering from this database, than from a regular population sample?

please advise

Who is Participating?
QCDConnect With a Mentor Commented:
Okay, let's say a few words in general about bias.

Bias is when your data gets skewed improperly because your sampling is not random.

But just because you are not sampling randomly, it doesn't necessarily follow that you will get bias.


You think that your sample is biased towards people who are tech-savvy.  This might be true, but unless cancer rates also favour people who are tech-savvy, this should not affect your results.

You think that people who are younger are more likely to respond.  This would have an impact on any type of cancer that prefers people of a certain age.

Fighting bias is a bit of an art.  First of all, it strongly depends on exactly what pieces of information you're trying to gather, since different factors will skew different pieces of information.  (Basically, you have to do this analysis separately for each conclusion you want to build.)  Second, you have to brainstorm all of the factors that distinguish your sample from a random sample of the general population, and then you have to decide whether this factor affects the question you are trying to answer.

Now, you have to decide whether you are going to try to compensate for it, or just acknowledge that your conclusions might be a bit skewed, and briefly explain how.  Compensating for bias is hairy business, so unless you think that your conclusions are absolutely worthless without it, I'd stay away from it.  But this is what it might look like:

Suppose you really have an age bias - most of your respondents are young.  Well, you can categorize your responses into age groups, then weight them so that their final influence on your data matches the distribution of ages in the general population.  This can be dangerous, because statistical errors increase wildly for small samples - if only two or three people over the age of 60 responded, their data will have an undue influence on your final conclusions.  This sort of procedure really only works when you have large enough samples to mitigate these issues.

Anyway, my recommendation would be to stay away from this business unless you have a statistician around to do the analysis.
David L. HansenProgrammer AnalystCommented:
It would only be a biased sample if the volunteers knew what kind of statistical analysis you were going to do and then were able to create/alter/update their medical history to "match" the analysis.
titorober23Author Commented:
I think i disagree, because this is a volunteer patient registry, so mostlikely people who can volunteer  are some how more tech savy than the others, so the database recordset is not a randomly chosend sample of the population.

please comment
A proven path to a career in data science

At Springboard, we know how to get you a job in data science. With Springboard’s Data Science Career Track, you’ll master data science  with a curriculum built by industry experts. You’ll work on real projects, and get 1-on-1 mentorship from a data scientist.

David L. HansenProgrammer AnalystCommented:
Your question has to do with bias.  There are a number of things that can sour a statistical analysis study, bias is one of them.  You're worry about the individuals being tech savvy will isolate a group of people, true, however that just gives you a smaller population.  It would be a problem if you were conduction a social or age-dependent study.  You are not, the cancer will not have any tendency to afflict tech-savvy individuals more or less than any other group of people (unless you are looking at cancer of the eye or something like that).  Again, bias will only be a factor if the individuals volunteering their information can alter that information based on their knowledge of what you are investigating.
Bias can be a factor if individuals with certain information are more likely to volunteer than individuals with other information.
David L. HansenProgrammer AnalystCommented:
Yes, ozo has it.  If somehow tech-savvy individuals with cancer have certain medical information that is constant and persistently different than all other people with cancer then you have a problem.  Can you think of why that might be?  I can not.
Certain types of medical information could be correlated with education, which could be correlated with tech-savvy.
Certain types of tech may be associated with certain kinds of cancer.
Certain types of medical information could be embarrassing or personal and less likely to be volunteered.
Certain types of people may be more likely to have their cancer diagnosed than others.
There may be many ways that we have not thought of that the sample may differ from a regular population.
Certain types of statistics may be more or less sensitive to these effects than other types of statistics.
titorober23Author Commented:
another issue, could be that my databse i have more young people, this could be a bias, because as we know young people tend to use the computer more that older people?
please comment
if you are doing statistics that depends on age, that could be another bias.
> it strongly depends on exactly what pieces of information you're trying to gather,
so what  statistics are you doing based on this database?
What questions are you interested in infeering from this database?
Any sample drawn from this database will likely be biased because not all cancers occur with the same regular frequency; those with more common cancers will be more likely to be chosen even if you randomly sample the database.

In addition, any sample drawn can be biased if the same cancers are isolated and the sample is drawn just from that sample; this is because not all treatments prescribed for the same type of cancer---it may be biased due to type of treatment.

Bottom line: What's your hypothesis? Without deriving hypotheses it is not possible to determine where the threats of bias will derive from; thus you do not know how to select a sample so that you distribute any extraneous variability across your sample as equally as possible.
titorober23Author Commented:
my main goal is to infere about survival and how well patients respond to a drug, but those questions that no depends on age or if they are tech savvy
I want to know how long before they progress, all these questions depend on the diagnosis and type of cancer.

please comment
titorober23Author Commented:
records in these database are from patients with one type of cancer, in other words, this is a database of patients with the same cancer
jazjefConnect With a Mentor Commented:
You could use a statistical model such as Logistic Regression or Discriminant Analysis...

Logistic Regression typically uses a binary classification outcome variable.... such as "HI back injury risk" VS "LOW back injury risk". Discriminant Analysis allows you to use 3,4,5, or more possible classification groups---such as Very LO, LO, MED, HI, Very HI

For example, using Discriminant Analysis...

Classify each patient as being as a HI, MED, or LO etc "responder" based on your criteria of of a 'responder' using all variables that are relevant.

Use Discriminant Analysis to create a statistical model and put all variables of interest---including tech-savvy and age etc into the analysis [you will need a program such as SPSS or SAS]

Discriminant Analysis will "predict" the accuracy of the group assignments (called 'classification') that you have made---(HI, MED, LO) and will generate a mathematical equation that has a "weighting" for each variable.  LIke this: Group = a + b1*x1 + b2*x2 + ... + bm*xm

Use the mathematical equation variable "weightings" to determine which variables are contributing the greatest amount of influence in the equation

Each weighting tells you the contribution of that variable to the equation that predicts what group each case of data is in. Larger weightings mean that the variable is more influential.... smaller weights mean less influential

Now that  you know which variables influence (HI, MED, LO) groups the most you can then focus  on  analyses to compare just those variables between groups and see if they differ significantly

Plus, in the future you can use the equation variable weighting values to construct a formula that will immediately predict if they are a HI, LO, or MED responder just by collecting some survey data up-front when they are diagnosed and begin their treatment.

ozoConnect With a Mentor Commented:
Randomly assigning who gets the drug could remove one source of systematic bias.
Evaluation of response should also be done without knowledge of who got the drug in order to remove another kind of systematic bias.
It is possible that the particular population on the study responds to the drug differently than the general population,
but a drug that works only on a particular sub population may still be useful, and if indications of effectiveness are found,
that may warrant further study in which other factors that may influence efficacy can be teased out more carefully.
There is a risk of not detecting a drug that is effective only on some people if you happen to test it only on people for whom it is not effective, but that risk is likely to be on the same order as the random variance in your experiment anyway,
and if the effect of the drug is too weak to detect above the selection bias in your study population,
it may be that other drugs may be more promising for treatment anyway, so this risk is probably not your greatest concern.

titorober23Author Commented:
but all analysis are done to all records in the database, this is a database where ptients volunteer the information, we are trying to infere about all patients,basing the analysis on the records in the database.
You must answer some fundamental questions for yourself first; then provide your answers to the forum if you want an efficient solution:

1. You mention 'analysis'.... Specifically, what kind of inferential analysis are you wanting to do?
2.  Is your dependent variable of interest nominal, ordinal, interval or ratio?
3. What is your research question?
4. What is your hypothesis?

titorober23Author Commented:
1.- Survival rate
Treatment eficacy
relation between diagnosis and survival
relation between type and survival
relation between dosage and survival

2.- dependable variable is interval
3.- many of 1.-
4.- based on 1.-
Patients who are actively sharing their medical history are not deceased----it appears that you do not mention in any of your posts if there are deceased persons in the database; this is essential information for 'survival rate' assessment.

Imagine calculating the 'survival rate' of animals rescued from the BP oil spill without any data on how many of the animals have died and not having any comparison data.

You have options---here are 3:
1) You will need to wait for a sufficient number of your database membership to succumb to their illness
2) You already have access to this data and you can assemble a comparison data set
3) Someone else has this data and you can get the data set from them

If there are deceased persons in your data set then you want to use Multivariate Logistic Regression. This analysis uses a binary dependent variable that is categorical; it will allow you to classify each case of data into two groups: 1 = survivor, 2 = non-survivor. You can yse any/all types of variables with Logistic Regression as well.
titorober23Author Commented:
unfortunately yes, and most of the deceased patient of the databse volunteered their information in the past and we kept it.
I am been using the kaplan-Meier estimator.
nickalhConnect With a Mentor Commented:
How much do you need to trust the answer you're getting?

We're assuming this is not anywhere near the stage of needing FDA or government approval.
(In that case, you need to consulted a qualified statistician directly, and ethically, we should stop helping.)

Is this simply a feasibility study?  Examining the data to see which drugs warrant greater investigation?
Which drugs should be consider spending more serious money on?

I''m not talking about confidence intervals (yet).

For the deceased patients, was the questionaire, volunteer legal waiver the same?  
I would guess family members are more willing to share a deceased person's data, because it's not their personal information.

How many people in the database?  1,000?  10,000?  100,000?
For the larger numbers,
Suppose you really have an age bias - most of your respondents are young.  Well, you can categorize your responses into age groups, then weight them so that their final influence on your data matches the distribution of ages in the general population.  This can be dangerous, because statistical errors increase wildly for small samples - if only two or three people over the age of 60 responded, their data will have an undue influence on your final conclusions.  This sort of procedure really only works when you have large enough samples to mitigate these issues.
becomes more effective, usable, and more reliable.

Has the data been gathered consistently?
Measuring from time of diagnosis to end of life, could be one consistent method.
But then the ideal would be to include a control for how far advanced the cancer was at diagnosis.  Is that a very subjective measure?  (Doctor's opinion)  Or an objective measure (size of the tumor in mm?)

And BTW, infer has only one 'e'.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.