asked on

Storing and Retrieving Matrices from a Database

How best do I store and retrieve matrices from a database? Taking MySQL, for example, do I simply use 3 tables: the document, the attributes and the document_attributes, or is that too simplistic and poor for performance. Do I not use mysql and something of the NoSQL sort like MongoDB?

Ray Paseur

Please post a sample of the data that constitutes your matrix (or matrices). Your suggestion about three tables sounds intuitively right (assuming document_attributes is a junction table) but it might be easier to understand the question if we can see the data. Thanks, ~Ray

aristanoble

ASKER

Well, the articles table, for example, would contain a list of articles: article-1, article-2, ...article-n.
The attributes table would contain a list of attributes: attribute-1, attribute-2, ... attribute-n.
And the document_attributes junction table would contain a list of "junctions":
article-1_id:attribute-1_id, article-1_id:attribute-2_id, article-1_id:attribute-3_id
article-2_id:attribute-1_id
article-3_id:attribute-3_id,article-3_id:attribute-4_id;

Ray Paseur

That sounds correct to me. What kind of data is this in the article and attribute tables? Please post the CREATE TABLE statements and some example data for these tables, thanks.

Data base performance is usually most affected by the size of the server memory, the indexing strategy and the complexity of the queries. A query that joins two tables via a junction table is not one that might be called complex.

aristanoble

ASKER

Thanks Ray for your help. I haven't created the tables because I was first trying to figure out the best way to store this data or even if I should be using MySQL. So I'm still in the planning stages. The reason why I ask about how to store the data is because when it comes to retrieving the data to produce an incidence matrix, for example. Take for example the comparison of n documents and n terms:

d1 d2 d3 ... dn
t1 0 1 0 1
t2 1 1 0 0
t3 1 1 1 0
...
tn 0 0 1 1

Where d = document, t = term, and 0/1 denotes the presence or absence of the term.
How would I select the data to be able to produce such a matrix?

Ray Paseur

How would I select the data...?

The answer will be dependent on the data, which is why I want to see some test data so I understand what we're dealing with!

If I have some spare time this morning I may try to create a little test data set so I can show you how I would generate this matrix. It's a fairly well understood design pattern in computer science. Intersections generate the "1" values.

ASKER CERTIFIED SOLUTION

Ray Paseur

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

aristanoble

ASKER

I understand the solution and it's very insightful, thank you. My biggest concern would be beginning at line 259. Iterating over n molecules and connecting to mysql inside that iteration could be costly time wise and resource wise. Would there be a better way with or without MySQL (i.e. using another method/specialized software)?

Ray Paseur

... connecting to mysql inside that iteration could be costly time wise ...

"Connecting" is a term of art in data base work, and what you're actually looking at there is a query, not a connection. Unless you have tested it and timed it you could not know how long that query takes, but I would like to suggest that it is representative of the relational nature of relational data bases. It gives us the relationships of the alphabet keys to the molecules keys. If these columns are indexed, it is essentially an in-memory operation and will complete in lightning fast times. Having had some experience with these kinds of relational tables, I would not waste time looking for a better way to do it.