how to calculate needed sample size and confidence interval to validate data extracted from free-text reports?

Posted on 2009-02-11
Last Modified: 2013-11-13
I need some help. I need to determine the sample size to validate the extraction of medication instances from free text reports. I have something like (I'm making up the numbers since another person is counting the reports, etc).

I have 200 patients divided in four groups:
group 1: 50 patients exposed to a medication who did not have a side effect
group 2: 50 patients exposed to medication who had a side effect
group 3: 50 patients not exposed to medication who did not have a side effect
group 4: 50 patients not exposed to medication who had a side effect (caused by something else)

for these 200 patients I have (I'm making this number up) 15432 reports that have been processed with an Natural Language Processing tool. Since we need to validate the extraction process against a gold standard (actual person checking the reports to see if extraction was correct). The extraction process consists of basically identifying all instances of a given medication inside a report. So, for example, if I want to know if there's a mention of aspirin in the report, the tool will extract it. There could be no mention, or the medication (in this case, aspirin) could appear one or more times in the document/report.

I think I can merge groups 1 and 2 since I'm interested in detecting medication exposure and not the presence of a side effect. So, this will give me 2 groups:
Group A: pts exposed to medication and
Group B: patients not exposed to medication - each group with 100 pts and their corresponding number of reports adding up to 15432.

 In order to do this, I need to:

1. determine whether I need to calculate my sample for validation at a medication instance level or at a document/report level so I can calculate specificity, sensitivity, accuracy, precision. However, I do not know the actual number of instances -since they are being picked by the tool, although I know the actual number of reports.

2. determine the number of reports or instances I need so my results are statistically  significant.

I think I need the following to determine the sample size:
confidence level - I think that 95% is customary??
confidence interval  <- I DO NOT KNOW HOW TO DETERMINE THIS
Population size  <_ SHOULD THIS BE MY 15,432 reports or should it be another number?
What are the underlying assumptions? Is this a binomial distribution?

Any ideas??

Thanks so much!

Question by:garus
    LVL 9

    Expert Comment

    There are a lot of questions here...

    You seem to be talking about power analysis, but power analysis is unnecessary if you already have data.  If you don't already have statistical significance, you need more people.  If you do, you don't.  If you want to know the specific number more cases you should get to have a specific power level (for example, an 80% chance to detect your effect), then that is a more appropriate quesiton - is this what you want?

    Confidence level is the probability that you are willing to accept of a Type I Error occurring.  So, if you are willing for there to be a 5% chance that if there was really no difference between groups and there really was one, 95% confidence would be the result.  In medicine, 99% or 99.9% are more typical.  Social sciences usually use 95%.

    Confidence intervals are computed differently depending on the test, but generally are of the form:
    computed statistic +/- (standard error * test statistic)

    We refer to the population as the count of whatever our base level of comparison is, which is usually a theoretical value.  If you can collect all data from the entire population, there is no need for the use of inferential statistics (i.e. statistical significance testing).

    Author Comment

    Hi richdiesal,
    Thanks for your reply. I'm trying to validate the output of the NLP processing, so I need to know how many reports I need to pick and manually review to see if the output of the NLP tool is correct or not without going through all of them, but a number large enough so I can say the NLP works correctly.

    Maybe my question was too long. I was trying to explain the problem.
    LVL 9

    Accepted Solution

    If you're just interested in double-checking the output of an automated tool (I'm not familiar with NLP in particular), there isn't really any sort of test to use.  It would just be a matter of what seems reasonable.  I know in my field, by-hand recodes of 25% of the data to check accuracy aren't uncommon.  But I imagine that varies depending on who is checking and what sounds reasonable to them.

    Author Closing Comment


    Featured Post

    Free Trending Threat Insights Every Day

    Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

    Join & Write a Comment

    Setting up SVN Server using Windows and Apache Purpose of the document:       This article will explain the process of how to configure SVN repository in a windows environment using APACHE web server. What is SVN? ( …
    Introduction This article explores the design of a cache system that can improve the performance of a web site or web application.  The assumption is that the web site has many more “read” operations than “write” operations (this is commonly the ca…
    This video is in connection to the article "The case of a missing mobile phone (". It will help one to understand clearly the steps to track a lost android phone.
    Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.

    733 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    21 Experts available now in Live!

    Get 1:1 Help Now