Nura111
asked on
The consideration of chosing one table vs two mysql?
HI, I was wondering the general consideration when creating a new table in the db between choosing one table or two.
In my case I need to create table/tables that will store keywords from messages(there are two types of messages type a and b) and the number of
occurrences of the word in each of the different messages type. also I will need indication for hoe many total words are in in each of the different messages type.
so my question is I can eithe create one table :
word,typeAocc,typeBocc
and than every time I will need to get the total words for each words It will go over the whole table
or two table
1 wordTypeA wordType B
and than I can just count the number of rows for each table which is probably stored somewhere
but I have 2 table
anyway Its might be a stupid question but Ive been wonder about this type of issues for a while
Thanks you
In my case I need to create table/tables that will store keywords from messages(there are two types of messages type a and b) and the number of
occurrences of the word in each of the different messages type. also I will need indication for hoe many total words are in in each of the different messages type.
so my question is I can eithe create one table :
word,typeAocc,typeBocc
and than every time I will need to get the total words for each words It will go over the whole table
or two table
1 wordTypeA wordType B
and than I can just count the number of rows for each table which is probably stored somewhere
but I have 2 table
anyway Its might be a stupid question but Ive been wonder about this type of issues for a while
Thanks you
I think this particular data modeling decision can be better informed by how you intend to use that data. Since it looks like your implementing a intelligence tool (something that aggregates information that you use to make decisions about other things) the typical rules about normalization go out the window (since by counting something from another record and storing it elsewhere, you have already de-normalized the data), and you want to focus primarily on what information you are interested in, and what is the most efficient way to retrieve it.
So, what is the high-level question are you trying to answer with this information? The low-level question is, of course, what you have already indicated - how many times did word XYZ appear in message type A and message type B. A high-level question is the the one that lead you to start collecting this data, something like "Do people mention 'toilet paper' more on AIM or on Skype?"
So, what is the high-level question are you trying to answer with this information? The low-level question is, of course, what you have already indicated - how many times did word XYZ appear in message type A and message type B. A high-level question is the the one that lead you to start collecting this data, something like "Do people mention 'toilet paper' more on AIM or on Skype?"
ASKER
merwetta1:the problem was when i want to get the total number of words for each type.
So I will have to do something like select count(*) where typeAocc > 0
And I have to do that a lot and I thought that every time it will have to go throw all the table to check that and that its not efficient (but I dont understand a lot in that subject so im just assuming)
AngryBinary: basically what im doing is storing keyword from spam and non spam emails. and than every time I get an email I'm performing a Bayesian spam filtering so I will check the probability for each keyword from spam and non spam email so I will need the TypeAocc and TypeBocc for each keyword.
And now I relized that I need spmehow to store the number of emails I checked (I thought It will be enoufh the number of keywords but its not)
so to add to my question how can I store that as well because it doesnt make sense to add this field to every row right? than everytime I change that I will need ot change it for every row
So I will have to do something like select count(*) where typeAocc > 0
And I have to do that a lot and I thought that every time it will have to go throw all the table to check that and that its not efficient (but I dont understand a lot in that subject so im just assuming)
AngryBinary: basically what im doing is storing keyword from spam and non spam emails. and than every time I get an email I'm performing a Bayesian spam filtering so I will check the probability for each keyword from spam and non spam email so I will need the TypeAocc and TypeBocc for each keyword.
And now I relized that I need spmehow to store the number of emails I checked (I thought It will be enoufh the number of keywords but its not)
so to add to my question how can I store that as well because it doesnt make sense to add this field to every row right? than everytime I change that I will need ot change it for every row
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
ok now im confused I already have the emails in an exsiting table with other inof but I dont want a realition to an email beside the number emails counted for every group spam/non spam)
The model does give you the number of emails of a particular type that a word appears in:
select count(1)
from keywordsinemails k
inner join emails e
on e.email_id = k.email_id
where email_type = 'spam'
and k.word = 'vacation'
To get the frequency of the word in spam emails (the average number of times it appears in each email where it does appear), it would be:
select avg(occurance)
from keywordsinemails k
inner join emails e
on e.email_id = k.email_id
where email_type = 'spam'
and k.word = 'vacation'
Basically, you get more information from your data at minimal additional cost, the cost being whatever is required to capture the number of occurrences of each word in each email when you build your data set. If you don't want it, then leave that out.
select count(1)
from keywordsinemails k
inner join emails e
on e.email_id = k.email_id
where email_type = 'spam'
and k.word = 'vacation'
To get the frequency of the word in spam emails (the average number of times it appears in each email where it does appear), it would be:
select avg(occurance)
from keywordsinemails k
inner join emails e
on e.email_id = k.email_id
where email_type = 'spam'
and k.word = 'vacation'
Basically, you get more information from your data at minimal additional cost, the cost being whatever is required to capture the number of occurrences of each word in each email when you build your data set. If you don't want it, then leave that out.
ASKER
okay Thank you but if I don't want the relation to the emails is it make sense to do one table
keywords:
word,spampcc,nonspamocc,to talspamema il,totalno nspamemail
but the totals will be the same number each row.
The thing is I have the word count I want to insert the data inot the db and than update it for every email
It doesnt make sense?
keywords:
word,spampcc,nonspamocc,to
but the totals will be the same number each row.
The thing is I have the word count I want to insert the data inot the db and than update it for every email
It doesnt make sense?
No, it doesn't make sense to put all the keywords in the same column as the email.
In order to search a large number of records, your database will have to do a full table scan and a string search through every single record.
If the words are kept in a separate table, one word per row in an indexed column, searching will be much more efficient - queries will run many times faster, in fact.
I don't understand the second part of your question - I think you're typing too fast.
In order to search a large number of records, your database will have to do a full table scan and a string search through every single record.
If the words are kept in a separate table, one word per row in an indexed column, searching will be much more efficient - queries will run many times faster, in fact.
I don't understand the second part of your question - I think you're typing too fast.
ASKER
<<No, it doesn't make sense to put all the keywords in the same column as the email.
which email? Im not saving the email just a number of the total emails that were used
what I mean I don't have the information for every keyword from each email its retrieved.
So I cant do it the way you suggested considering that what is the best way to your opinion?
Thank you.
which email? Im not saving the email just a number of the total emails that were used
what I mean I don't have the information for every keyword from each email its retrieved.
So I cant do it the way you suggested considering that what is the best way to your opinion?
Thank you.
ASKER
and their is still one word per row in the one table I suggested
When I talk about "the email" in the database, I'm not necessarily talking about the message. I just mean a record in the database that identifies that an email exists somewhere with an arbitrary ID, so that we can use that ID to relate it to other things. Identifying individual emails provides a higher degree of granularity to the data so that the data can be sliced different ways to provide different sorts of metrics.
You seem to like the idea of storing the data in a single table with the columns:
word | count A | count B
... so, why not do that? The model I presented is how I would have approached the problem; I work a great deal in business intelligence and reporting, and that model provides the most possible contexts for analyzing the data.
But, if it's just the single context that you are interested in, and you are reasonably certain that it's the only one you'll ever need from the data, then the single table approach works, especially if it's the one that most closely aligns with your understanding of the information that you're storing.
You seem to like the idea of storing the data in a single table with the columns:
word | count A | count B
... so, why not do that? The model I presented is how I would have approached the problem; I work a great deal in business intelligence and reporting, and that model provides the most possible contexts for analyzing the data.
But, if it's just the single context that you are interested in, and you are reasonably certain that it's the only one you'll ever need from the data, then the single table approach works, especially if it's the one that most closely aligns with your understanding of the information that you're storing.
ASKER
yes but the problem is its need to be word | count A | count B |totalA |totalb
and its just seem stupid to me now (that totalA totalb are the same number in each row and I will need to update them in the all table everytime they r changing
and its just seem stupid to me now (that totalA totalb are the same number in each row and I will need to update them in the all table everytime they r changing
TotalA and totalB are what exactly? If they change based on the data, they are calculated values and shouldn't be stored, they should be calculated when you need them. If its a sum of all counts, then you can simply ask the database for the sum of all the numbers in the "count" column.
ASKER
Are the total number of the emails that was checked for the keyword. I cant calculate after I need to save it when the keyword is found.
ASKER
Ok so what do you think about one table as you suggested
Emails
------------------
email_id
email_type ('a' or 'b')
email_text (maybe you don't need this)
keywords (1/0) indicate the email is checked for keywords
KeywordsInEmails
------------------------
word
occuranceA
occuranceB
------------------------
and than If I want to see the total number of email checked for keyword im checking select count(*) from Emails where type = a AND keywords =1.
but I techenically dont have any relation between emails table and keywords
Is that make sense?
Emails
------------------
email_id
email_type ('a' or 'b')
email_text (maybe you don't need this)
keywords (1/0) indicate the email is checked for keywords
KeywordsInEmails
------------------------
word
occuranceA
occuranceB
------------------------
and than If I want to see the total number of email checked for keyword im checking select count(*) from Emails where type = a AND keywords =1.
but I techenically dont have any relation between emails table and keywords
Is that make sense?
If that's the only information you are interested in, then I guess it makes sense. But, it seems like you are arbitrarily limiting the information that you can derive from your data by failing to maintain a relationship between the email and the keyword.
ASKER
ok I dont see any reason why I want to know from which email id the keyword came from (beside if its spam or not) Do you?
and in the way that im initaily counting the words im counting it as it was one text (the whole emails together)
so I dont have that inofrmation
and in the way that im initaily counting the words im counting it as it was one text (the whole emails together)
so I dont have that inofrmation
update table_name set TypeAocc = TypeAocc + 5 where word = 'theword'
The 2 table approach would force you to do separate queries on the tables when it came time to retrieve values...a JOIN wouldn't work, because there's no guarantee both tables would have all the same words. So the one table approach seems better for your needs.