Is my table structure correct?

So I'm trying to work out the best table structure and how to interface with the data...

I want to store in a database a list of urls and "tag" them with multiple pre-defined keywords for me to search on later.  The tags on the urls may need to be updated from time to time. ie it's not a one off tagging but an on-going review of each URL.  More tags may be added in the future and I may need to go back to the urls to re-tag them.

I have the list of predefined tags/keywords in a table and I'm now trying to work out what the structure of the resultant table should be in order for it to be easy to update/add/remove the tags for any given url but also to be able to search for a url based on a keyword.

So far I have
tags table
ID | NAME
1 | medical
2 | karate
3 | Technical
4 | Javascript
5 | Shopping

For the resultant table, my initial structure:
URL & TAG_ID are a UNIQUE combination
ID | URL | TAG_ID
1 | www.e-e.com | 4
2 | www.e-e.com | 3
3 | www.google.com | 1
4 | www.google.com | 2
5 | www.amazon.com | 5

So you can see there's multiple entries for one url so that there's a one-to-one match with a tag.  I'm not sure if this is the best approach?

BTW The urls and tags given above are for example only and haven't determined the ones I'll use yet.  I'm more interested in the structure I would use.

If the above table structure is sound, then how do I go about adding/removing tags?  The server will receive a set of tags from the front end... Can MySQL do a diff and only add/remove those records that do / do not match?


I gather for searching it's just a matter of a query/view that joins both tables where a keyword search can be done?

Your help is greatly appreciated :)
LVL 43
RobOwner (Aidellio)Asked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

slightwv (䄆 Netminder) Commented:
While I ponder how I would attempt to solve this, something to think about is homographs:
https://en.wikipedia.org/wiki/Homograph

In the Oracle Text document that use the example crane (bird) and crane (machine).

Do you need to account for this in your tag table is a tag word is "crane"?
RobOwner (Aidellio)Author Commented:
Not concerned with double meanings on the tags.  for instance, I would have them as "crane (bird)" and "crane (machine)" as separate tags
gr8gonzoConsultantCommented:
If you think you'll be limiting the number of tags to just a handful, it might be better to NOT use a normalized structure and simply have a lengthy text field that contains a delimited set of tags, like:

Site | Tags
www.e-e.com | technical JavaScript

It would likely be faster, which is good for tag searches.

If you anticipate a LOT of tags, or want to do lots of tag analysis, then the normalized structure is better but in that case, you'd probably want a separate table for the domains and then just have a mapping table to link the domains to the tags.

Or if you have plenty of space, do both so you have fast searching on the denormalized table and then analysis on the normalized tables.

To answer your other question, no,MySQL doesn't have a diff or merge function like that - you'd have to just do it in code but that's not difficult.
Big Business Goals? Which KPIs Will Help You

The most successful MSPs rely on metrics – known as key performance indicators (KPIs) – for making informed decisions that help their businesses thrive, rather than just survive. This eBook provides an overview of the most important KPIs used by top MSPs.

Tomas Helgi JohannssonCommented:
Hi,

I would go with this structure which you posted.
As gr8gonzo mentions MySQL doesn't have the MERGE / DIFF functions (yet) however you could create a stored procedure that take URL and tag values which you have previously loaded into temp table and issue delete, INSERT INTO ... ON DUPLICATE ... and/or REPLACE commands
to update your tables.
Having proper indexes on the table will speed up the queries and updates you run on those tables.
One good reason for this structure is that you can either issue

select t.name tag_name,group_concat(DISTINCT u.url ORDER BY u.url) urls
from tagtable t, urltable u
where t.id = u.tag_id
group by t.name

Open in new window

or
select u.url,group_concat(DISTINCT t.name ORDER BY t.name) tag_name
from tagtable t, urltable u
where t.id = u.tag_id
group by u.url

Open in new window


The first query groups all urls into a list that belongs to a tag while the second query returns list of tags belonging to an url.
Such queries would be complex and heavy if the tables are not normalized.
Also having the tags as a concated list in the URL table would make updates quite complex and heavy if you want to remove/update certain tags for all associated urls.

https://dev.mysql.com/doc/refman/8.0/en/insert.html
https://dev.mysql.com/doc/refman/8.0/en/replace.html
http://www.mysqltutorial.org/mysql-stored-procedure-tutorial.aspx

Regards,
    Tomas Helgi
Scott PletcherSenior DBACommented:
You should also "normalize out" the URL to another table.  That is, also use a number to encode the URL.  That's at least as important for the URLs as for the tags.

Most important of all, since URL and tag are a unique combination, you do not need an ID column on that table.  Of course you can add one if you just can't accept a table without an identity, but even then, DO NOT MAKE ID THE CLUSTERING KEY!  You have a natural clustering key already that will better support your searches.

URLs : key = URL_id
tags : key = tag_id
URL_tags : key = ( tag_id, URL_id ) (if you'll mostly search by tag first, which it sounds like you'll do).  Also, as new tags are added, they will be kept together as well.  Or key by ( URL_id, tag_id ) if somehow you won't search mostly by tag_id?
slightwv (䄆 Netminder) Commented:
I also vote for the normalized approach and based on what you've described, the two tables you have should work.

I also agree with removing URL repeating values.  Store once, store repeating, preferably numeric, keys.

I'm wondering about the removal requirement.  If you receive a list of tags from a front end, what tags in the table are you wanting to remove or update?  Add I assume is "If not there then insert"?
RobOwner (Aidellio)Author Commented:
Fantastic info. Thank you everyone.

I see how important that is to move the URLs out too.

Almost done here I think. The complexity remaining may just be stored by the stored procedure? Easy when a tag doesn't exist for a url, just insert..  it's when the tags change and I need to remove one. I guess what I'm after is avoiding doing this line by line (which may not be possible).
Is it silly / inefficient that when tags come back to just delete all the entries for that url and add them back again given all the tags for a given url are always passed back and forth?
slightwv (䄆 Netminder) Commented:
>>Is it silly / inefficient that when tags come back to just delete all the entries for that url and add them back again given all the tags for a given url are always passed back and forth?

That is what I was thinking.  Each new submission is a complete overlay.

I'm thinking this should be easy enough with small amounts of tags per document even in a LARGE amount of overall tags.  Indexing should make it pretty quick but your mileage might vary.

Even with Gigs of data on hardware capable of handling that mount of data, deleting 100 or so rows to turn right back around and reinsert them shouldn't be a huge lift.

Maybe not "ideal" but without some mechanism to also pass in some sort of "action" flag for Add, Update, Delete with the tag, not sure how you would handle the update piece.  Even on the deletes without the flag, the database would still have to access the blocks for that doc that aren't in the new list to delete.  Might as well just delete/replace.

Now if you can track changes on the client side and pass in a list like:  A:Karate, D:Javascript, C:Shopping/Buying, then it might make sense to do targeted DML.  You would need to see if that would be more efficient but I still betting a full, replace everything might still be faster.   Definitely less code to write/debug/maintain.
gr8gonzoConsultantCommented:
>>Is it silly / inefficient that when tags come back to just delete all the entries for that url and add them back again given all the tags for a given url are always passed back and forth?

I wouldn't recommend this. I tell my clients that they should minimize deleting and updating data from tables whenever possible.

1. The main issue is that updating and deleting can result in index locks during the operations, so if you make it part of the normal operation, it can lead to poor long-term performance as your tables grow and become more active. This is an overly simplistic, but a safer recommendation. If you want to get into the details, you should read the MySQL documentation about row and table locking and how your engine choice (MyISAM, InnoDB, etc) impacts those concepts (although I usually recommend InnoDB unless you really know for sure what you're doing - it tends to have less "unwanted" locking).

The layman's explanation is to imagine going into a library and looking for a book through one of those card catalogs. Let's say the library is occasionally adding and removing books from the shelves. When they add a book, they don't need to interrupt your search - they just drop a new card into the front of the card catalog drawer. But if they REMOVE or REPLACE a book, then the librarian takes the entire card catalog away for a moment, searches for the cards to remove/replace, and then makes the change, and then puts it back. And since maintaining an accurate catalog is more important than your book search (especially if you're looking for a book that JUST got removed), that operation takes precedence. And you just have to wait until that card catalog is updated before you can do your search.

Now imagine that when the library was small, it was common practice that whenever some new books came in, the librarian did this whole update thing where they remove 50 existing cards, then re-add 46 desired ones. Since the library was small, you didn't have many people using the card catalog, so a little inefficiency wasn't that big of a deal. But then the library grew and grew and grew, and now you've suddenly got a line of people out the door, all waiting for the librarian to finish updating the card catalog. To make matters worse, the bigger the library, the more activity there is, so this whole updating debacle is done even more frequently than before. That should give you an idea of what's going on here, and the long-term impact of it.

2. There are going to be times where an "update" doesn't really update anything. You might remove 15 tags and re-add the same 15 tags. When data changes within a table, you're also invalidating the query cache for that table, which can negatively impact your DB performance.

3. A good developer rule of thumb is to treat the DB as your most precious resource. Web server resources are cheap and easy to load balance, so you can be inefficient in code and waste a few CPU cycles here and there, but you should try to only interact with the database when absolutely necessary so that you're preserving its resources for other connections.

So yes, deleting and re-adding everything might simplify your code a tiny bit, but it's really not a good practice, and it's pretty easy to determine differences in code. Pseudo-code looks something like this:

// Get the IDs of the existing tags for a URL
$oldTagIDs = query("SELECT tag_id FROM tags2urls WHERE url_id=12345");  // foo = 1, helloworld = 2, crane = 4

// Get the IDs for the desired tags
$newTagNames = array("foo", "bar", "bird");
$newTagIDs = query("SELECT tag_id FROM tags WHERE name IN ('foo','bar'); // foo = 1, bar = 3, bird = 5

// Determine what to add or remove.
$addTags = array_diff($newTagIDs , $oldTagIDs ); // bar = 3, bird = 5
$removeTags = array_diff($oldTagIDs , $newTagIDs ); // helloworld = 2, crane = 4

if(count($removeTags))
{
  query("DELETE FROM tags2urls WHERE url_id=12345 AND tag_id IN (2, 4)"); // Delete tags 2 and 4
}
if(count($addTags))
{
  query("INSERT INTO tags2urls (url_id, tag_id) VALUES (12345, 3), (12345, 5)"); // Add tags 3 and 5 for URL 12345
}

Open in new window

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
slightwv (䄆 Netminder) Commented:
My MySQL is only high-level.  I've installed it and played with it but never done deep performance tuning on it.

Yes, I tend to assume InnoDB when I think MySQL.  That is a dangerous assumption...  but unless you absolutely have to, why use anything else?

Scanned:
https://dev.mysql.com/doc/refman/8.0/en/innodb-locks-set.html

which got me to:
https://dev.mysql.com/doc/refman/8.0/en/innodb-locking.html#innodb-next-key-locks

Guess I can see it as something to consider.  Not sure I would worry that much about it unless every transaction per second counts.

Of course if the coded version takes 1 second and the possible full lock version takes 10 seconds, SURE, WORRY.  If the lock version takes 1.5 or 2 seconds, does it matter?  Only Rob can answer that one...

>>Pseudo-code looks something like this:

Wouldn't you want those initial selects as a locking read if there is a possibility of two transactions trying to change the tags for a specific URL?
Scott PletcherSenior DBACommented:
If you're clustering on ( tag_id, URL_id) (or vice versa), then DELETE and re-INSERT.  SQL will do that with the rows anyway when you do an UPDATE.

I'm assuming the typical data pattern here: that SELECTs will vastly outnumber INSERTs/DELETEs.  If so, it's still often best to cluster for lookup, and accept a bit more overhead of the DELETE+INSERT.
gr8gonzoConsultantCommented:
Of course if the coded version takes 1 second and the possible full lock version takes 10 seconds
Yes, it's less of an issue now. When I get brought in to consult on performance problems with databases, it's almost always the same story - "Nobody ever thought it would get this big."

Granted, not every project is popular enough to grow like that, but I think it's generally better to eliminate as much DB inefficiency as soon as it's feasible and cost-effective. For example, implementing load-balanced DB servers for a project that only has 5 users is not cost-effective. But when it's something that's easily accomplished and can make a big difference in the long term, why not do it?

One of my clients had an issue where they had a process that NORMALLY took about 200-300 milliseconds total. It did some queuing and automatic cleanup, which involved frequent deletes and they kept running into issues with timeouts (they hosted a web service component that had a required time SLA of a response within 10 seconds). It was because of the "automatic cleanup" that the original developer had implemented a long time ago. Sure, it kept one particular table nice and neat, and nobody raised the alarms early on when they first started seeing some random timeouts (once every few thousand transactions) because they figured it was just network glitches. They brought me in because it was getting to the point of happening multiple times an hour. When they had any spikes in volume (all the time), it was queuing up all of these queries that were deleting from that table, so one of mission-critical components of a very well-known, international company was being bottlenecked by one of the most irrelevant and unimportant tables ever. Moral of the story - the biggest problems can start with the smallest oversights.

Wouldn't you want those initial selects as a locking read
Or wrapped in a transaction or something.
slightwv (䄆 Netminder) Commented:
I will easily bow to your actual MySQL expertise over mine!!!  I'm n00t status...

>>Moral of the story - the biggest problems can start with the smallest oversights.

Agreed.  In this instance the potential bottleneck has been documented?

I see your point and can concede it depending on the actual requirements and time to implement.

I also know that those "I'll get to it later" tasks are rarely ever addressed.
RobOwner (Aidellio)Author Commented:
haha you feel like a n00t! LOL

@gr8gonzo thanks for the code - I see how the DELETE and INSERT works there

@scott
If you're clustering on ( tag_id, URL_id) (or vice versa), then DELETE and re-INSERT.  SQL will do that with the rows anyway when you do an UPDATE.
that clustering definitely looks the way to go and the SELECT/DELETE/INSERT wrapped in a locking transaction

much appreciated everyone
RobOwner (Aidellio)Author Commented:
Thank you. Fantastic answers and helpful
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Query Syntax

From novice to tech pro — start learning today.