Link to home
Start Free TrialLog in
Avatar of garus
garusFlag for United States of America

asked on

how to calculate needed sample size and confidence interval to validate data extracted from free-text reports?

I need some help. I need to determine the sample size to validate the extraction of medication instances from free text reports. I have something like (I'm making up the numbers since another person is counting the reports, etc).

I have 200 patients divided in four groups:
group 1: 50 patients exposed to a medication who did not have a side effect
group 2: 50 patients exposed to medication who had a side effect
group 3: 50 patients not exposed to medication who did not have a side effect
group 4: 50 patients not exposed to medication who had a side effect (caused by something else)

for these 200 patients I have (I'm making this number up) 15432 reports that have been processed with an Natural Language Processing tool. Since we need to validate the extraction process against a gold standard (actual person checking the reports to see if extraction was correct). The extraction process consists of basically identifying all instances of a given medication inside a report. So, for example, if I want to know if there's a mention of aspirin in the report, the tool will extract it. There could be no mention, or the medication (in this case, aspirin) could appear one or more times in the document/report.

I think I can merge groups 1 and 2 since I'm interested in detecting medication exposure and not the presence of a side effect. So, this will give me 2 groups:
Group A: pts exposed to medication and
Group B: patients not exposed to medication - each group with 100 pts and their corresponding number of reports adding up to 15432.

 In order to do this, I need to:

1. determine whether I need to calculate my sample for validation at a medication instance level or at a document/report level so I can calculate specificity, sensitivity, accuracy, precision. However, I do not know the actual number of instances -since they are being picked by the tool, although I know the actual number of reports.

2. determine the number of reports or instances I need so my results are statistically  significant.

I think I need the following to determine the sample size:
confidence level - I think that 95% is customary??
confidence interval  <- I DO NOT KNOW HOW TO DETERMINE THIS
Population size  <_ SHOULD THIS BE MY 15,432 reports or should it be another number?
What are the underlying assumptions? Is this a binomial distribution?


Any ideas??

Thanks so much!

Avatar of richdiesal
richdiesal
Flag of United States of America image

There are a lot of questions here...

You seem to be talking about power analysis, but power analysis is unnecessary if you already have data.  If you don't already have statistical significance, you need more people.  If you do, you don't.  If you want to know the specific number more cases you should get to have a specific power level (for example, an 80% chance to detect your effect), then that is a more appropriate quesiton - is this what you want?

Confidence level is the probability that you are willing to accept of a Type I Error occurring.  So, if you are willing for there to be a 5% chance that if there was really no difference between groups and there really was one, 95% confidence would be the result.  In medicine, 99% or 99.9% are more typical.  Social sciences usually use 95%.

Confidence intervals are computed differently depending on the test, but generally are of the form:
computed statistic +/- (standard error * test statistic)

We refer to the population as the count of whatever our base level of comparison is, which is usually a theoretical value.  If you can collect all data from the entire population, there is no need for the use of inferential statistics (i.e. statistical significance testing).
Avatar of garus

ASKER

Hi richdiesal,
Thanks for your reply. I'm trying to validate the output of the NLP processing, so I need to know how many reports I need to pick and manually review to see if the output of the NLP tool is correct or not without going through all of them, but a number large enough so I can say the NLP works correctly.

Maybe my question was too long. I was trying to explain the problem.
ASKER CERTIFIED SOLUTION
Avatar of richdiesal
richdiesal
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of garus

ASKER

Thanks!