Link to home
Start Free TrialLog in
Avatar of titorober23
titorober23

asked on

statistics: sampling from database

Hi
I have a patient registry, this is a database where patients with specific cancer volunteer to share  their medical history.
I am doing statistics based on this database, now
is this a biased sampl? why?
how can i minimize the bias of infeering from this database, than from a regular population sample?

please advise

thanks
Avatar of David L. Hansen
David L. Hansen
Flag of United States of America image

It would only be a biased sample if the volunteers knew what kind of statistical analysis you were going to do and then were able to create/alter/update their medical history to "match" the analysis.
Avatar of titorober23
titorober23

ASKER

I think i disagree, because this is a volunteer patient registry, so mostlikely people who can volunteer  are some how more tech savy than the others, so the database recordset is not a randomly chosend sample of the population.

please comment
Your question has to do with bias.  There are a number of things that can sour a statistical analysis study, bias is one of them.  You're worry about the individuals being tech savvy will isolate a group of people, true, however that just gives you a smaller population.  It would be a problem if you were conduction a social or age-dependent study.  You are not, the cancer will not have any tendency to afflict tech-savvy individuals more or less than any other group of people (unless you are looking at cancer of the eye or something like that).  Again, bias will only be a factor if the individuals volunteering their information can alter that information based on their knowledge of what you are investigating.
Bias can be a factor if individuals with certain information are more likely to volunteer than individuals with other information.
Yes, ozo has it.  If somehow tech-savvy individuals with cancer have certain medical information that is constant and persistently different than all other people with cancer then you have a problem.  Can you think of why that might be?  I can not.
Certain types of medical information could be correlated with education, which could be correlated with tech-savvy.
Certain types of tech may be associated with certain kinds of cancer.
Certain types of medical information could be embarrassing or personal and less likely to be volunteered.
Certain types of people may be more likely to have their cancer diagnosed than others.
There may be many ways that we have not thought of that the sample may differ from a regular population.
Certain types of statistics may be more or less sensitive to these effects than other types of statistics.
another issue, could be that my databse i have more young people, this could be a bias, because as we know young people tend to use the computer more that older people?
please comment
if you are doing statistics that depends on age, that could be another bias.
ASKER CERTIFIED SOLUTION
Avatar of QCD
QCD
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
> it strongly depends on exactly what pieces of information you're trying to gather,
so what  statistics are you doing based on this database?
What questions are you interested in infeering from this database?
Any sample drawn from this database will likely be biased because not all cancers occur with the same regular frequency; those with more common cancers will be more likely to be chosen even if you randomly sample the database.

In addition, any sample drawn can be biased if the same cancers are isolated and the sample is drawn just from that sample; this is because not all treatments prescribed for the same type of cancer---it may be biased due to type of treatment.

Bottom line: What's your hypothesis? Without deriving hypotheses it is not possible to determine where the threats of bias will derive from; thus you do not know how to select a sample so that you distribute any extraneous variability across your sample as equally as possible.
my main goal is to infere about survival and how well patients respond to a drug, but those questions that no depends on age or if they are tech savvy
I want to know how long before they progress, all these questions depend on the diagnosis and type of cancer.

please comment
records in these database are from patients with one type of cancer, in other words, this is a database of patients with the same cancer
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
but all analysis are done to all records in the database, this is a database where ptients volunteer the information, we are trying to infere about all patients,basing the analysis on the records in the database.
You must answer some fundamental questions for yourself first; then provide your answers to the forum if you want an efficient solution:

1. You mention 'analysis'.... Specifically, what kind of inferential analysis are you wanting to do?
2.  Is your dependent variable of interest nominal, ordinal, interval or ratio?
3. What is your research question?
4. What is your hypothesis?


1.- Survival rate
Treatment eficacy
relation between diagnosis and survival
relation between type and survival
relation between dosage and survival

2.- dependable variable is interval
3.- many of 1.-
4.- based on 1.-
Patients who are actively sharing their medical history are not deceased----it appears that you do not mention in any of your posts if there are deceased persons in the database; this is essential information for 'survival rate' assessment.

Imagine calculating the 'survival rate' of animals rescued from the BP oil spill without any data on how many of the animals have died and not having any comparison data.

You have options---here are 3:
1) You will need to wait for a sufficient number of your database membership to succumb to their illness
2) You already have access to this data and you can assemble a comparison data set
3) Someone else has this data and you can get the data set from them

If there are deceased persons in your data set then you want to use Multivariate Logistic Regression. This analysis uses a binary dependent variable that is categorical; it will allow you to classify each case of data into two groups: 1 = survivor, 2 = non-survivor. You can yse any/all types of variables with Logistic Regression as well.
unfortunately yes, and most of the deceased patient of the databse volunteered their information in the past and we kept it.
I am been using the kaplan-Meier estimator.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial