[x]
Posted via EE Mobile

Search, ask, and monitor your questions on the go with EE Mobile. Visit Experts Exchange from your mobile device and never be out of touch again.

Question
[x]
Attachment Details
[x]
The Solution Rating System

With so many solutions, how can you tell which solutions are most likely to help you and which ones are not? To provide you with a tool to use, we rate our solutions based on various elements that most accurately determine if a solution is a quality solution. To explain what factors affect the solution rating, here are the elements we take into consideration when formulating our solution rating.

  • The Grade of the Solution
  • The Zone Rank of the Expert Providing the Solution
  • The Number of Author and Expert Comments
  • The Number of Experts Contributing
  • The Feedback of the Community

Your Input Matters
Because of the way the system is set up, the most important variable in this equation is you. As a member of Experts Exchange, you are able to cast your vote on the quality of the solutions in regard to how complete, accurate, helpful and easy to understand each solution is. When you provide your feedback, each rating is adjusted accordingly. So, if you see a solution that has a poor rating that you think is a good solution, let us know by rating it. As you do, the rating will be adjusted and will become more accurate for other members of our site.

If you have any suggestions that you would like to make for our rating system, please ask a question in the Suggestions Zone of Community Support.

Thank you!

9.0

how to calculate needed sample size and confidence interval to validate data extracted from free-text reports?

Asked by garus in Probability & Statistics

Tags: statistics question about sample size and confidence interval

I need some help. I need to determine the sample size to validate the extraction of medication instances from free text reports. I have something like (I'm making up the numbers since another person is counting the reports, etc).

I have 200 patients divided in four groups:
group 1: 50 patients exposed to a medication who did not have a side effect
group 2: 50 patients exposed to medication who had a side effect
group 3: 50 patients not exposed to medication who did not have a side effect
group 4: 50 patients not exposed to medication who had a side effect (caused by something else)

for these 200 patients I have (I'm making this number up) 15432 reports that have been processed with an Natural Language Processing tool. Since we need to validate the extraction process against a gold standard (actual person checking the reports to see if extraction was correct). The extraction process consists of basically identifying all instances of a given medication inside a report. So, for example, if I want to know if there's a mention of aspirin in the report, the tool will extract it. There could be no mention, or the medication (in this case, aspirin) could appear one or more times in the document/report.

I think I can merge groups 1 and 2 since I'm interested in detecting medication exposure and not the presence of a side effect. So, this will give me 2 groups:
Group A: pts exposed to medication and
Group B: patients not exposed to medication - each group with 100 pts and their corresponding number of reports adding up to 15432.

 In order to do this, I need to:

1. determine whether I need to calculate my sample for validation at a medication instance level or at a document/report level so I can calculate specificity, sensitivity, accuracy, precision. However, I do not know the actual number of instances -since they are being picked by the tool, although I know the actual number of reports.

2. determine the number of reports or instances I need so my results are statistically  significant.

I think I need the following to determine the sample size:
confidence level - I think that 95% is customary??
confidence interval  <- I DO NOT KNOW HOW TO DETERMINE THIS
Population size  <_ SHOULD THIS BE MY 15,432 reports or should it be another number?
What are the underlying assumptions? Is this a binomial distribution?


Any ideas??

Thanks so much!

[+][-]02/11/09 06:00 PM, ID: 23618560Accepted Solution

View this solution now by starting your 30-day free trial. Setting up your free trial is quick, easy, and secure. We will return you to this solution, unlocked, when you're done.

About this solution

Zone: Probability & Statistics
Tags: statistics question about sample size and confidence interval
Sign Up Now!
Solution Provided By: richdiesal
Participating Experts: 1
Solution Grade: B
 
[+][-]02/11/09 05:28 PM, ID: 23618447Expert Comment

At Experts Exchange, members can ask their questions to thousands of technology professionals, also known as Experts. Experts compete and collaborate to answer those questions by leaving comments like this one.

Start your 30-day free trial to view this Expert Comment or ask the Experts your question.

 
[+][-]02/11/09 05:52 PM, ID: 23618526Author Comment

Often, when Experts are collaborating with members who have asked questions, they will request additional information about the problem. Askers respond with an author comment like this one.

Start your 30-day free trial to view this Author Comment or ask the Experts your question.

 
 
Loading Advertisement...
20091118-EE-VQP-93 - Hierarchy / EE_QW_3_20080625