Solved

how do I follow these instructions to convert the XML wiktionary database to MYSQL?

Posted on 2008-10-02
9
1,283 Views
Last Modified: 2010-04-21
I am trying to follow this tutorial to convert wiktionary to a mysql database:

http://alexpb.com/notes/articles/2007/09/25/importing-wikimedia-dump-file-to-mysql/

I downloaded mwdumper.jar and the relevant wiktionary dump from http://download.wikimedia.org/enwiktionary/20080613/ (specifically, pages-articles.xml.bz2)

I then created the mysql database 'wiktionary' and created the tables using tables.sql linked to in the tutorial, and ran the suggested command from the terminal, getting this result

java -jar /Applications/mwdumper.jar --format=sql:1.5 /Applications/enwiktionary-20080613-pages-articles.xml.bz2 | mysql -u root -p wiktionary
-bash: mysql: command not found
1,000 pages (687.285/sec), 1,000 revs (687.285/sec)
2,000 pages (531.915/sec), 2,000 revs (531.915/sec)
3,000 pages (613.874/sec), 3,000 revs (613.874/sec)
4,000 pages (760.456/sec), 4,000 revs (760.456/sec)
[this goes on for about 920,000 pages]

This doesn't seem to have actually had any result, since I don't see any data in my wiktionary database. Presumably this has something to do with the "mysql: command not found" error. (I get the same result running as root.)

how can I fix this problem? for what it's worth, I am running mysql on OS X, installed via XAMPP (xampp.org), and my knowledge of linux is paltry.
0
Comment
Question by:bitt3n
  • 4
  • 3
9 Comments
 
LVL 23

Accepted Solution

by:
Mysidia earned 500 total points
ID: 22630663
-bash: mysql: command not found
^^^^^^^^^^^^^^^^^^^^^^^^

That's a problem.   You need to have your mysql command line client working correctly.
If you can't type
mysql -u root -p wiktionary

at a shell prompt and have a 'mysql>'  prompt,  then that command won't work.

You need to either type the full path to the mysql client binary
or edit your PATH  environment variable to include the path to the 'mysql' program.




0
 

Author Comment

by:bitt3n
ID: 22630984
ok changing the path made something happen, since I'm now seeing data in my database, just not the data I was expecting.

what I was expecting was tables including a table of words and wordids and a table of definitions with corresponding wordids, so that I could call up definitions for words the way wiktionary itself does on pages like this:

http://en.wiktionary.org/wiki/spring

what I'm getting is just three tables, text, revision, and page, and the only thing that looks like a list of words is the 'page_title' field in the page table, with no definitions in any of the tables. From http://download.wikimedia.org/enwiktionary/20080613/ it looks like I am downloading the right XML file for wiktionary, (specifically, pages-articles.xml.bz2), so I am confused why I don't see the data.

Is it possible that I am somehow still screwing up the XML to MySQL conversion, or am I being an idiot in some entirely unrelated way?
0
 

Author Comment

by:bitt3n
ID: 22630993
note if my problem is now with finding the right database rather than importing it, I'm happy to open a new question for this, given that if that is true, the XML conversion is working fine
0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 
LVL 23

Expert Comment

by:Mysidia
ID: 22638742
Examine a few rows of the  'tables', 'text' and 'page' tables.

I believe you will find page information in the 'pages' table
and actual document texts in the 'text' table.

The thing to keep in mind, is you are importing a MediaWiki  WIKI Database.


It is much more complicated than a simple database of words and definitions;
words are defined within the body of  "pages"  in the Wiktionary wiki.

All text contents should be in the 'text' table, if you imported properly.
Including old revisions that aren't active definitions.

The 'pages'  table if imported properly,  should be the table of pages, and
there would be an integer indexing the proper entry in the 'text' table for the
current revision of the page.

Other 'texts'  that link to that page being historical revisions stored in the database.











0
 

Author Comment

by:bitt3n
ID: 22873428
sorry for the delayed response. It looks like you were correct, and the page_title and page_latest columns of the page table correspond to the word and the word_id, while the BLOB value in old_text and the id of old_id in the text table corresponds to the page content and the matching word_id. (I'm not yet sure what page_id in the page table is used for, since it doesn't correspond to old_id.)

so it appears I can now look up a word's page, but the page data looks like the below (for the word "pound"). It seems like this data is in a format which will be difficult to manipulate by, for example, associating each meaning with a meaning_id. Would it be correct to say that this database is useful for serving web pages, but not for actually manipulating the data they contain? What is the format the data is in now called?

Here's the beginning of the 'definition' (page) for the word "pound":

content is ==English== ===Pronunciation=== * {{IPA|/paŠnd/}} * {{audio|en-us-pound.ogg|Audio (US)}} * {{rhymes|aŠnd}} ===Etymology 1=== {{OE.}} ''[[pund]]'', from {{Ger.}} ''*punda-'', an early borrowing from {{L.}} ''[[pondo|pondM]]'' in ''[[libra|l+bra]] [[pondo|pondM]]'' a pound by weight (a misinterpretation: ''[[pondo|pondM]]'' is the adverb meaning by weight, originally an ablative form of ''[[pondus]]''). Cognate with Dutch ''[[pond]]'', German ''[[Pfund]]'', Swedish ''[[pund]]''. ====Noun==== {{en-noun}} # A [[unit]] of [[weight]]: of 16 [[ounce]]s in the [[avoirdupois]] system (= 453.592 g) or of 12 [[troy ounce]]s in the [[troy weight|troy system]] (= 373.242 g). Its symbol is [[lb]]. # [[pound-force|Pound-force]]. # {{US}} The symbol '''#''' # A unit of [[currency]] used in Cyprus; Egypt; Lebanon; the United Kingdom and its [[dependency|dependencies]]; and formerly in the Republic of Ireland (which now uses the [[euro]]) and Israel (which now uses the [[sheqel]]). Its symbol is [[£]]. =====Translations===== {{trans-top|unit of mass}} * Arabic: {{Arab|[[('HF/]]}} {{IPAchar|(baund)}} {{m}} * [[Basque]]: [[libra]] * Bosnian: [[funta]] {{f}} * Bulgarian: [[DC=B]] {{m}} * [[Catalan]]: [[lliura]] {{f}} * Chinese: [[Å]] (bàng) * Czech: [[libra]] {{f}} * Danish: [[pund]] {{n}} * Dutch: [[pond#Dutch|pond]] {{n}} * Finnish: [[naula]], [[pauna]] * French: [[livre]] {{f}} * German: [[Pfund]] {{n}} * Greek: [[»¯²Á±]] {{f}} * Hungarian: [[font#Hungarian|font]] * Irish: [[punt]] {{m}} * Italian: [[libbra]] {{f}} {{trans-mid}} * Japanese: [[ÝóÉ]] (pondo) * Korean: [[ ´Ü]] (paundeu) * Lithuanian: [[svaras]] * [[Luxembourgish]]: [[Pond]] * Maltese: [[libra]] {{f}} * Norwegian: [[pund]] {{n}} * Polish: [[funt]] {{m}} * Portuguese: [[libra]] {{f}} * Russian: [[DC=B]] (funt) {{m}} * Slovak: [[libra]] {{f}} * Slovene: [[funt]] {{m}} * Spanish: [[libra]] {{f}} * Swahili: [[ratili]] ''(noun 9/10)'' * Swedish: [[pund]] {{n}} * Yiddish: [[äÕàØ]] (punt) {{n}} {{trans-bottom}} {{trans-top|US: symbol #}} * Finnish: [[ristikkomerkki]], [[risuaita]] (''colloquial'') * Spanish: [[almohadilla]] {{f}} {{italbrac|Spain}}, [[gato]] {{m}} {{italbrac|Chile|Mexico}}, [[libra]] {{f}} {{italbrac|Puerto Rico}}, [[numeral]] {{m}} {{italbrac|Latin America}}, [[número]] {{m}} {{italbrac|Latin America}} {{trans-mid}} {{trans-bottom}} {{trans-top|unit of currency}} * Czech: [[libra]] {{f}} * Dutch: [[pond#Dutch|pond]] {{n}} * Finnish: [[punta]] * French: [[livre]] {{f}} * German: [[Pfund]] {{n}} * Greek: [[»¯Á±]] {{f}} {{trans-mid}} * Irish: [[punt]] {{m}} * Japanese: [[ÝóÉ]] (pondo) * Maltese: [[lira]] {{f}} * Spanish: [[libra]] {{f}} * Swedish: [[pund]] {{n}} {{trans-bottom}} =====Synonyms===== * (''unit of currency''):

0
 
LVL 23

Expert Comment

by:Mysidia
ID: 22910817
yes...   the data in the table contains the text of the page in Wikicode and is truly meant to be processed by a MediaWiki  server  to render a web page.

It is this way by the very nature Wiktionary is built  (contributors edit wiki pages not 'definition lists',  and the pages they edit are stored).

To use it in another manner, you will need to decide how you want to try to parse the text,  and it is an extremely difficult task and could take a skilled programmer a long time to perfect a parser for extracting only the definition  but all the definition.

There may be ways to extract definitions from the text which for some words incldues many different sections formatted different ways;  to the extent that there is consistency in the way Wiktionary  definitions have been formatted.

Actually, pages on Wiktionary can be long, and there aren't formatting conventions precise enough that you can easily scrape  a simple  one-paragraph, plaintext definition.

Any parser one can easily write would be heuristic in nature,  and  would fail for some words in the Wiktionary  where the wiki page is formatted differently enough to be an exception that confuses parsing rules that work for other pages.

0
 

Author Closing Comment

by:bitt3n
ID: 31502631
thanks!!
0

Featured Post

Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

Join & Write a Comment

Foreword In the years since this article was written, numerous hacking attacks have targeted password-protected web sites.  The storage of client passwords has become a subject of much discussion, some of it useful and some of it misguided.  Of cou…
The purpose of this article is to demonstrate how we can use conditional statements using Python.
Connecting to an Amazon Linux EC2 Instance from Windows Using PuTTY.
Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.

706 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now