Link to home
Start Free TrialLog in
Avatar of Nura111
Nura111

asked on

The consideration of chosing one table vs two mysql?

HI, I was wondering the general consideration when creating a new table in the db between choosing one table or two.

In my case I need to create table/tables that will store keywords from messages(there are two types of messages type a and b) and the number of
occurrences   of the word in each of the different messages type. also I will need indication for hoe many total words are in in each of the different messages type.

so my question is I can eithe create one table :
word,typeAocc,typeBocc

and than every time I will need to get the total words for each words It will go over the whole table

or two table
1 wordTypeA wordType B

and than I can just count the number of rows for each table which is probably stored somewhere

but I have 2 table

anyway Its might be a stupid question but Ive been wonder about this type of issues for a while

Thanks you

Avatar of merwetta1
merwetta1

I believe you want the one table approach. You would have the table indexed by the word, then whenever you needed to update the count, you could do that with something like this:

update table_name set TypeAocc = TypeAocc + 5 where word = 'theword'

The 2 table approach would force you to do separate queries on the tables when it came time to retrieve values...a JOIN wouldn't work, because there's no guarantee both tables would have all the same words. So the one table approach seems better for your needs.
I think this particular data modeling decision can be better informed by how you intend to use that data. Since it looks like your implementing a intelligence tool (something that aggregates information that you use to make decisions about other things) the typical rules about normalization go out the window (since by counting something from another record and storing it elsewhere, you have already de-normalized the data), and you want to focus primarily on what information you are interested in, and what is the most efficient way to retrieve it.

So, what is the high-level question are you trying to answer with this information? The low-level question is, of course, what you have already indicated - how many times did word XYZ appear in message type A and message type B. A high-level question is the the one that lead you to start collecting this data, something like "Do people mention 'toilet paper' more on AIM or on Skype?"
Avatar of Nura111

ASKER

merwetta1:the problem was when i want to get the total number of words for each type.
So I will have to do something like select count(*) where typeAocc > 0

And I have to do that a lot and I thought that every time it will have to go throw all the table to check that and that its not efficient (but I dont understand a lot in that subject so im just assuming)

AngryBinary: basically what im doing is storing keyword from spam and non spam emails. and than every time I get an email I'm performing a Bayesian spam filtering so I will check the probability for each keyword from spam and non spam email so I will need  the TypeAocc and TypeBocc for each keyword.
And now I relized that I need spmehow to store the number of emails I checked (I thought It will be enoufh the number of keywords but its not)

so to add to my question how can I store that as well because it doesnt make sense to add this field to every row right? than everytime I change that I will need ot change it for every row
ASKER CERTIFIED SOLUTION
Avatar of AngryBinary
AngryBinary

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Nura111

ASKER

ok now im confused I already have the emails in an exsiting table with other inof but I dont want a realition to an email beside the number emails counted for every group spam/non spam)
The model does give you the number of emails of a particular type that a word appears in:

select count(1)
from keywordsinemails k
inner join emails e
on e.email_id = k.email_id
where email_type = 'spam'
and k.word = 'vacation'

To get the frequency of the word in spam emails (the average number of times it appears in each email where it does appear), it would be:

select avg(occurance)
from keywordsinemails k
inner join emails e
on e.email_id = k.email_id
where email_type = 'spam'
and k.word = 'vacation'

Basically, you get more information from your data at minimal additional cost, the cost being whatever is required to capture the number of occurrences of each word in each email when you build your data set. If you don't want it, then leave that out.
Avatar of Nura111

ASKER

okay Thank you but if I don't want the relation to the emails is it make sense to do one table

keywords:
word,spampcc,nonspamocc,totalspamemail,totalnonspamemail

but the totals will be the same number each row.

The thing is I have the word count I want to insert the data inot the db and than update it for every email

It doesnt make sense?




No, it doesn't make sense to put all the keywords in the same column as the email.

In order to search a large number of records, your database will have to do a full table scan and a string search through every single record.

If the words are kept in a separate table, one word per row in an indexed column, searching will be much more efficient - queries will run many times faster, in fact.

I don't understand the second part of your question - I think you're typing too fast.
Avatar of Nura111

ASKER

<<No, it doesn't make sense to put all the keywords in the same column as the email.  
which email? Im not saving the email just a number of the total emails that were used

what I mean I don't have the information for every keyword from each email its retrieved.

So I cant do it the way you suggested considering that what is the best way to your opinion?

Thank you.
Avatar of Nura111

ASKER

and their is still one word per row  in the one table I suggested
When I talk about "the email" in the database, I'm not necessarily talking about the message. I just mean a record in the database that identifies that an email exists somewhere with an arbitrary ID, so that we can use that ID to relate it to other things. Identifying individual emails provides a higher degree of granularity to the data so that the data can be sliced different ways to provide different sorts of metrics.

You seem to like the idea of storing the data in a single table with the columns:

     word | count A | count B

... so, why not do that? The model I presented is how I would have approached the problem; I work a great deal in business intelligence and reporting, and that model provides the most possible contexts for analyzing the data.

But, if it's just the single context that you are interested in, and you are reasonably certain that it's the only one you'll ever need from the data, then the single table approach works, especially if it's the one that most closely aligns with your understanding of the information that you're storing.
Avatar of Nura111

ASKER

yes but the problem is its need to be  word | count A | count B |totalA |totalb

and its just seem stupid to me now (that totalA totalb are the same number in each row and I will need to update them in the all table everytime they r changing
TotalA and totalB are what exactly? If they change based on the data, they are calculated values and shouldn't be stored, they should be calculated when you need them. If its a sum of all counts, then you can simply ask the database for the sum of all the numbers in the "count" column.
Avatar of Nura111

ASKER

Are the total number of the emails that was checked for the keyword. I cant calculate after I need to save it when the keyword is found.
Avatar of Nura111

ASKER

Ok so what do you think about one table as you suggested
Emails
------------------
email_id
email_type ('a' or 'b')
email_text (maybe you don't need this)
keywords (1/0) indicate the email is checked for keywords

KeywordsInEmails
------------------------
word
occuranceA
occuranceB

------------------------
and than If I want to see the total number of email checked for keyword  im checking  select count(*) from Emails where type = a AND keywords =1.

but I techenically dont have any relation between emails table and keywords

Is that make sense?


If that's the only information you are interested in, then I guess it makes sense. But, it seems like you are arbitrarily limiting the information that you can derive from your data by failing to maintain a relationship between the email and the keyword.
Avatar of Nura111

ASKER

ok I dont see any reason why I want to know from which email id the keyword came from (beside if its spam or not) Do you?
and in the way that im initaily counting the words im counting it as it was one text (the whole emails together)
so I dont have that inofrmation