I need some help. I need to determine the sample size to validate the extraction of medication instances from free text reports. I have something like (I'm making up the numbers since another person is counting the reports, etc).
I have 200 patients divided in four groups:
group 1: 50 patients exposed to a medication who did not have a side effect
group 2: 50 patients exposed to medication who had a side effect
group 3: 50 patients not exposed to medication who did not have a side effect
group 4: 50 patients not exposed to medication who had a side effect (caused by something else)
for these 200 patients I have (I'm making this number up) 15432 reports that have been processed with an Natural Language Processing tool. Since we need to validate the extraction process against a gold standard (actual person checking the reports to see if extraction was correct). The extraction process consists of basically identifying all instances of a given medication inside a report. So, for example, if I want to know if there's a mention of aspirin in the report, the tool will extract it. There could be no mention, or the medication (in this case, aspirin) could appear one or more times in the document/report.
I think I can merge groups 1 and 2 since I'm interested in detecting medication exposure and not the presence of a side effect. So, this will give me 2 groups:
Group A: pts exposed to medication and
Group B: patients not exposed to medication - each group with 100 pts and their corresponding number of reports adding up to 15432.
In order to do this, I need to:
1. determine whether I need to calculate my sample for validation at a medication instance level or at a document/report level so I can calculate specificity, sensitivity, accuracy, precision. However, I do not know the actual number of instances -since they are being picked by the tool, although I know the actual number of reports.
2. determine the number of reports or instances I need so my results are statistically significant.
I think I need the following to determine the sample size:
confidence level - I think that 95% is customary??
confidence interval <- I DO NOT KNOW HOW TO DETERMINE THIS
Population size <_ SHOULD THIS BE MY 15,432 reports or should it be another number?
What are the underlying assumptions? Is this a binomial distribution?
Any ideas??
Thanks so much!