Generate Data in T-SQL


I have two datasets and both are medical records . Now one contains records with the following columns:
1 Product
2 Diagnosis
3 Type of Doctor
4  No of Packets of Product prescribed

The secon dataset contains records with the following columns:
1 Products
2.Type of Doctor
3. No of Packets of Product Prescribed

So in the second dataset, I am missing diagnosis information. Therefore to have an idea of the spread of diagnosis information on the second dataset and to report on the second dataset, I am supposed to use information gathered (diagnosis probability information) generated from the first dataset to apply on the second dataset (i.e. a way of putting diagnosis information on the second dataset uisng T-SQL or any method).
The end will be to build an SSAS cube on the data. If there is a way to do this in a cube straight ahead, that will be a great idea as well.

 How can I do this using T-SQL and SSAS.

A clear step of how to do this with examples is highly welcome

I have attached the files containg the sample datasets here with the question.

Thanks for your ideas.

 First-Dataset.csv First-Dataset.csv
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

[] has to be used to enclosed columns with spaces.

table with the diagnosis will be a
table without will be b
select [a.Product code], a.Diagnosis, [a.type of Doctor] from tablea a join tableb b on ( [a.product code]=[b.product code] and [a.type of doctor]=[b.type of doctor] )

This should return a third dataset with the combination of the two i.e. trimming dataset 1 based on data in dataset 2.

SSAS cube info
What field the diagnosis is based on? I mean what says that this value x means diagnosis y .
The last column in the first data set is the Diagnosis.
It seems to me as the asker merely wants to merge the two datasets on the common between them and include the diagnosis which is only available in the first data set.

second dataset has many duplicates which is why the example queries the first dataset and joins on the second.
Protecting & Securing Your Critical Data

Considering 93 percent of companies file for bankruptcy within 12 months of a disaster that blocked access to their data for 10 days or more, planning for the worst is just smart business. Learn how Acronis Backup integrates security at every stage

marvo2010Author Commented:
Hello @TempDBA,

The values in the Diagnosis column reresents codes for various diagnosis. so take them to mean for e.g E88.9 mean malaria
k50.0  means Typhoid fever. so the codes represent various diseases diagnosed.


marvo2010Author Commented:
@arnold - The first dataset is to be used as a sample dataset that shows the percentage or the number of packs of different drugs prescribed for different diseases (Diagnosis) by different doctors  in typical period.
so what is required is a way to generate the probability of this combinations and apply it to the dataset without Diagnosis information.

Its not meant to be a direct table join.
I am looking for how this can be done
Many thanks
dataset one has products and for what diagnosis they were prescribed by doctors
dataset two is products and doctors who prescribed them

There is no information on which a "period" can be defined.

You want to build a probability
What is the likelyhood that doctorA will prescribe product 1 for diagnosis I?

What is the probability that doctor A will prescribe product 1?
You looking to see given a diagnosis and a doctors what is the probability that product A will be prescribed?
Given a diagnosis a what is the probability product 1 will be prescribed?

IS this the cube you want to build within SSAS?

The link posted provides an example of data needed to construct a cube.

marvo2010Author Commented:
Hello @arnold you definitely got the analysis we want to make right using a cube that is built with results from the datasets.
What I dont know is if using a direct link to the datasets as you have shown me will give a good mapping of diagnosis on to the second dataset.

Please can you show a little detail using the dataset.

I appreciate your ideas

marvo2010Author Commented:
Hello @arnold ,

Also there is no intention to trim the second dataset. In actual fact the intention is to use the diagnosis data in the first dataset to allocate the product used by the doctors to various diagnosis based on the use of the products in the first dataset by doctors.

I hope In have made it bit clearer

The example I provided will not lead to what you want.
the difficulty the second set has little to no value since one has to assume that the product/doctor combination can be attributed/equated to the product/doctor/diagnosis in the first set.
second set data

First set data

You can use the second set data to build
I've not built SSAS cubes, but probability analysis I can do.
using a single doctor/single product
you can build from the first set the distribution of prescription for a specific diagnosis.

i.e. per doctor sums
the total number of a product prescibed for a diagnosis per doctor / total number of products prescribed by a doctor
etc. this will deal with getting the data with the group by Doctor, diagnosis
select product,doctor, diagnosis, prescribed/sum(prescribed) group by Doctor, diagnosis


total prescribed per product

select product,  SUM(prescribed) from [tablename]  group by product

total prescribed by diagnosis/product

select diagnosis, product, sum (prescribed) from [tablename] group by diagnosis

total product prescribed by each doctor

select doctor, product, sum (prescribed) from [tablename] group by doctor

totals prescribed by each doctor per diagnosis

select doctor,diagnosis,product,  SUM(prescribed) from [tablename] group by doctor, diagnosis;

Don't know whether these could be inputs to a single analysis database to generate/build the cube.
marvo2010Author Commented:
This sounds good but I wish it was clearer

Any ideas of probability or statistical analysis that can be performed on the first dataset to be applied on the second? That will seem more applicable.
More information here : There are millions of rows of the second dataset compared to the first dataset.
So the first dataset is being used as a sample to constrain the second dataset with Diagnosis information which was not provided in the second dataset when it was gathered.

The second set is full of duplicates so
i.e. API, prescribed the same product X number of times.
It could be possible to see whether the prescription in this case is was prescribed for diagnosis A versus diagnosis Y, etc. references some Business inteligence application

Building SSAS cubes is not something I am well versed in to provide information at this time.
marvo2010Author Commented:
Hello @Arnold,

I am well versed with building cubes. Very good at that. Its only to make sure that I can put the diagnosis data on the second data set based on the distribution of that same diagnosis information in the first dataset.
Thats where the challenge is.

marvo2010Author Commented:
I just hope someone can show me what is very definitive on how I can do this mapping or put simply
use a process to use the distribution of diagnosis data in the first dataset to the second dataset.

still waiting.

Thanks for anu ideas.

The second set will likely need to rely on the cube built with the first data set as the input. The data in the first set can be used to build/compute the probability of doctor, diagnosis, prescribing a product.

doctor API probably prescribed product 359126 primarily for diagnosis K50.0
marvo2010Author Commented:
hello @arnold,

Thats exactly the lines I have been thinking about but exactly how to implement it by calculating the probabilities of each product being prescribed for a diagnosis is the challenge.
And actually developing various views which can be used to analyse the second dataset using the probabilities derived from the first dataset.

I am not sure if it can be done in the cube or if the data can be generated first before building the cube.

This is where the selects I've included in the prior responses can be used as views that calculate total prescriptions by product
total product prescriptions per diagnosis
total product prescriptions by a  doctor
total product presciprions by a doctor per diagnosis

ie. create view total_product_prescriptions ( product, prescription ) select product,sum(prescription) as prescription from [yourfirstdatasettable] group by product

repeat the same with the other select examples to create the other views.
marvo2010Author Commented:
Hello @Arnold,

with your last comment, I can see that you fully understand what I require to to. Its just how to get these probabilities in one table and how to apply it to analyse the Second dataset that lacks prescription data. I am very close but when it looks like the idea of how to do it is coming into my head, I just seem to start from scratch again. I am just looking for a step by step process of applying the probability to generate a complete dataset of the second dataset and build my cube with it. As per building the cube, I am very okay with it. I just need to get the data in order first.

I have to thank you first for all the effort so far and hope that we can finish it. I still welcome your ideas


The different queries can be used to build the views for the various scenarios.
I'm not sure what or how you want the data in a single table, and since I am unfamiliar with the SSAS and cube setup, not sure what additional information you need to complete your cube creation.

So far, the examples have
total prescription by product.
total prescription by doctor for each product
total prescription by product per diagnosis
total prescription by doctor of a product per diagnosis
You could have doctors and their diagnosis distribution deals with doctors who prescribe several products for the same diagnosis.

What is the type of data that you are missing to complete the cube construct??
marvo2010Author Commented:
Hello @Arnold,

I can generate these views from the first dataset. which is quite okay
Now the challenge is how to apply these derived  vlaues to the second dataset. This is where the issue lies; especially the ones with diagosis information.

The second set is an input that using the probability of the data from the first to suggest the probable diagnosis by the doctor for which this prescription is was written.

I referenced the example previously for doctor API and prescribed product 359126 which is 80% for K50.0 or 94% for the combination of k50.0 and k50.1.

The only way to populate the diagnosis in dataset2 is based on probability.
I.e. the total sum of prescribed items by doctor A for product B distributed based on the probability of the use of that product in diagnosis by the specific doctor or by the mention prescribed product based on diagnosis.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
marvo2010Author Commented:
Hello Arnold,

Your last comment is exactly where I am :
"The second set is an input that using the probability of the data from the first to suggest the probable diagnosis by the doctor for which this prescription is was written."

 Its a pain that you do not use SSAS. So how do you think I can populate the diagnosis in dataset2 based on probability ffrom dataset1. I dont have a clue how to make this work within SSAS because the two datasets will constitute two different cubes. But Now how do I calculate the probabilities of different products being prescribed for various diagnosis in dataset1 and then use it to apply in dataset2. I am so close but I dont know how to apply this to work.

do you have anyone that you can ask as well?


Unfortunately, not.

The second set is an input and can be evaluated in two ways against the probability cube
i.e. probability of a medication for a diagnosis
select product, sum(prescribed) from secondset group by product
the other is
select doctor, product, sum(prescribed) from secondset group by doctor

The first provides a distribution of the total prescribed product into the probably diagnosis while the second does the same with the differentiation based on the doctor who prescribed it.

It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Microsoft SQL Server 2008

From novice to tech pro — start learning today.