how to calculate needed sample size and confidence interval to validate data extracted from free-text reports?

Posted on 2009-02-11
Medium Priority
Last Modified: 2013-11-13
I need some help. I need to determine the sample size to validate the extraction of medication instances from free text reports. I have something like (I'm making up the numbers since another person is counting the reports, etc).

I have 200 patients divided in four groups:
group 1: 50 patients exposed to a medication who did not have a side effect
group 2: 50 patients exposed to medication who had a side effect
group 3: 50 patients not exposed to medication who did not have a side effect
group 4: 50 patients not exposed to medication who had a side effect (caused by something else)

for these 200 patients I have (I'm making this number up) 15432 reports that have been processed with an Natural Language Processing tool. Since we need to validate the extraction process against a gold standard (actual person checking the reports to see if extraction was correct). The extraction process consists of basically identifying all instances of a given medication inside a report. So, for example, if I want to know if there's a mention of aspirin in the report, the tool will extract it. There could be no mention, or the medication (in this case, aspirin) could appear one or more times in the document/report.

I think I can merge groups 1 and 2 since I'm interested in detecting medication exposure and not the presence of a side effect. So, this will give me 2 groups:
Group A: pts exposed to medication and
Group B: patients not exposed to medication - each group with 100 pts and their corresponding number of reports adding up to 15432.

 In order to do this, I need to:

1. determine whether I need to calculate my sample for validation at a medication instance level or at a document/report level so I can calculate specificity, sensitivity, accuracy, precision. However, I do not know the actual number of instances -since they are being picked by the tool, although I know the actual number of reports.

2. determine the number of reports or instances I need so my results are statistically  significant.

I think I need the following to determine the sample size:
confidence level - I think that 95% is customary??
confidence interval  <- I DO NOT KNOW HOW TO DETERMINE THIS
Population size  <_ SHOULD THIS BE MY 15,432 reports or should it be another number?
What are the underlying assumptions? Is this a binomial distribution?

Any ideas??

Thanks so much!

Question by:garus
  • 2
  • 2

Expert Comment

ID: 23618447
There are a lot of questions here...

You seem to be talking about power analysis, but power analysis is unnecessary if you already have data.  If you don't already have statistical significance, you need more people.  If you do, you don't.  If you want to know the specific number more cases you should get to have a specific power level (for example, an 80% chance to detect your effect), then that is a more appropriate quesiton - is this what you want?

Confidence level is the probability that you are willing to accept of a Type I Error occurring.  So, if you are willing for there to be a 5% chance that if there was really no difference between groups and there really was one, 95% confidence would be the result.  In medicine, 99% or 99.9% are more typical.  Social sciences usually use 95%.

Confidence intervals are computed differently depending on the test, but generally are of the form:
computed statistic +/- (standard error * test statistic)

We refer to the population as the count of whatever our base level of comparison is, which is usually a theoretical value.  If you can collect all data from the entire population, there is no need for the use of inferential statistics (i.e. statistical significance testing).

Author Comment

ID: 23618526
Hi richdiesal,
Thanks for your reply. I'm trying to validate the output of the NLP processing, so I need to know how many reports I need to pick and manually review to see if the output of the NLP tool is correct or not without going through all of them, but a number large enough so I can say the NLP works correctly.

Maybe my question was too long. I was trying to explain the problem.

Accepted Solution

richdiesal earned 1500 total points
ID: 23618560
If you're just interested in double-checking the output of an automated tool (I'm not familiar with NLP in particular), there isn't really any sort of test to use.  It would just be a matter of what seems reasonable.  I know in my field, by-hand recodes of 25% of the data to check accuracy aren't uncommon.  But I imagine that varies depending on who is checking and what sounds reasonable to them.

Author Closing Comment

ID: 31545798

Featured Post

Upgrade your Question Security!

Add Premium security features to your question to ensure its privacy or anonymity. Learn more about your ability to control Question Security today.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Before You Read The Article Please make sure you understand these two concepts: Variable Scope (http://www.php.net/manual/en/language.variables.scope.php) and Property Visibility (http://www.php.net/manual/en/language.oop5.visibility.php).  And to …
"Disruption" is the most feared word for C-level executives these days. They agonize over their industry being disturbed by another player - most likely by startups.
this video summaries big data hadoop online training demo (http://onlineitguru.com/big-data-hadoop-online-training-placement.html) , and covers basics in big data hadoop .
When cloud platforms entered the scene, users and companies jumped on board to take advantage of the many benefits, like the ability to work and connect with company information from various locations. What many didn't foresee was the increased risk…
Suggested Courses
Course of the Month15 days, 8 hours left to enroll

850 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question