We help IT Professionals succeed at work.

We've partnered with Certified Experts, Carl Webster and Richard Faulkner, to bring you a podcast all about Citrix Workspace, moving to the cloud, and analytics & intelligence. Episode 2 coming soon!Listen Now


how to calculate needed sample size and confidence interval to validate data extracted from free-text reports?

garus asked
Medium Priority
Last Modified: 2013-11-13
I need some help. I need to determine the sample size to validate the extraction of medication instances from free text reports. I have something like (I'm making up the numbers since another person is counting the reports, etc).

I have 200 patients divided in four groups:
group 1: 50 patients exposed to a medication who did not have a side effect
group 2: 50 patients exposed to medication who had a side effect
group 3: 50 patients not exposed to medication who did not have a side effect
group 4: 50 patients not exposed to medication who had a side effect (caused by something else)

for these 200 patients I have (I'm making this number up) 15432 reports that have been processed with an Natural Language Processing tool. Since we need to validate the extraction process against a gold standard (actual person checking the reports to see if extraction was correct). The extraction process consists of basically identifying all instances of a given medication inside a report. So, for example, if I want to know if there's a mention of aspirin in the report, the tool will extract it. There could be no mention, or the medication (in this case, aspirin) could appear one or more times in the document/report.

I think I can merge groups 1 and 2 since I'm interested in detecting medication exposure and not the presence of a side effect. So, this will give me 2 groups:
Group A: pts exposed to medication and
Group B: patients not exposed to medication - each group with 100 pts and their corresponding number of reports adding up to 15432.

 In order to do this, I need to:

1. determine whether I need to calculate my sample for validation at a medication instance level or at a document/report level so I can calculate specificity, sensitivity, accuracy, precision. However, I do not know the actual number of instances -since they are being picked by the tool, although I know the actual number of reports.

2. determine the number of reports or instances I need so my results are statistically  significant.

I think I need the following to determine the sample size:
confidence level - I think that 95% is customary??
confidence interval  <- I DO NOT KNOW HOW TO DETERMINE THIS
Population size  <_ SHOULD THIS BE MY 15,432 reports or should it be another number?
What are the underlying assumptions? Is this a binomial distribution?

Any ideas??

Thanks so much!

Watch Question


There are a lot of questions here...

You seem to be talking about power analysis, but power analysis is unnecessary if you already have data.  If you don't already have statistical significance, you need more people.  If you do, you don't.  If you want to know the specific number more cases you should get to have a specific power level (for example, an 80% chance to detect your effect), then that is a more appropriate quesiton - is this what you want?

Confidence level is the probability that you are willing to accept of a Type I Error occurring.  So, if you are willing for there to be a 5% chance that if there was really no difference between groups and there really was one, 95% confidence would be the result.  In medicine, 99% or 99.9% are more typical.  Social sciences usually use 95%.

Confidence intervals are computed differently depending on the test, but generally are of the form:
computed statistic +/- (standard error * test statistic)

We refer to the population as the count of whatever our base level of comparison is, which is usually a theoretical value.  If you can collect all data from the entire population, there is no need for the use of inferential statistics (i.e. statistical significance testing).


Hi richdiesal,
Thanks for your reply. I'm trying to validate the output of the NLP processing, so I need to know how many reports I need to pick and manually review to see if the output of the NLP tool is correct or not without going through all of them, but a number large enough so I can say the NLP works correctly.

Maybe my question was too long. I was trying to explain the problem.
If you're just interested in double-checking the output of an automated tool (I'm not familiar with NLP in particular), there isn't really any sort of test to use.  It would just be a matter of what seems reasonable.  I know in my field, by-hand recodes of 25% of the data to check accuracy aren't uncommon.  But I imagine that varies depending on who is checking and what sounds reasonable to them.

Not the solution you were looking for? Getting a personalized solution is easy.

Ask the Experts


Access more of Experts Exchange with a free account
Thanks for using Experts Exchange.

Create a free account to continue.

Limited access with a free account allows you to:

  • View three pieces of content (articles, solutions, posts, and videos)
  • Ask the experts questions (counted toward content limit)
  • Customize your dashboard and profile

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.


Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.