how do I follow these instructions to convert the XML wiktionary database to MYSQL?

Posted on 2008-10-02
Last Modified: 2010-04-21
I am trying to follow this tutorial to convert wiktionary to a mysql database:

I downloaded mwdumper.jar and the relevant wiktionary dump from (specifically, pages-articles.xml.bz2)

I then created the mysql database 'wiktionary' and created the tables using tables.sql linked to in the tutorial, and ran the suggested command from the terminal, getting this result

java -jar /Applications/mwdumper.jar --format=sql:1.5 /Applications/enwiktionary-20080613-pages-articles.xml.bz2 | mysql -u root -p wiktionary
-bash: mysql: command not found
1,000 pages (687.285/sec), 1,000 revs (687.285/sec)
2,000 pages (531.915/sec), 2,000 revs (531.915/sec)
3,000 pages (613.874/sec), 3,000 revs (613.874/sec)
4,000 pages (760.456/sec), 4,000 revs (760.456/sec)
[this goes on for about 920,000 pages]

This doesn't seem to have actually had any result, since I don't see any data in my wiktionary database. Presumably this has something to do with the "mysql: command not found" error. (I get the same result running as root.)

how can I fix this problem? for what it's worth, I am running mysql on OS X, installed via XAMPP (, and my knowledge of linux is paltry.
Question by:bitt3n
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 3
LVL 23

Accepted Solution

Mysidia earned 500 total points
ID: 22630663
-bash: mysql: command not found

That's a problem.   You need to have your mysql command line client working correctly.
If you can't type
mysql -u root -p wiktionary

at a shell prompt and have a 'mysql>'  prompt,  then that command won't work.

You need to either type the full path to the mysql client binary
or edit your PATH  environment variable to include the path to the 'mysql' program.


Author Comment

ID: 22630984
ok changing the path made something happen, since I'm now seeing data in my database, just not the data I was expecting.

what I was expecting was tables including a table of words and wordids and a table of definitions with corresponding wordids, so that I could call up definitions for words the way wiktionary itself does on pages like this:

what I'm getting is just three tables, text, revision, and page, and the only thing that looks like a list of words is the 'page_title' field in the page table, with no definitions in any of the tables. From it looks like I am downloading the right XML file for wiktionary, (specifically, pages-articles.xml.bz2), so I am confused why I don't see the data.

Is it possible that I am somehow still screwing up the XML to MySQL conversion, or am I being an idiot in some entirely unrelated way?

Author Comment

ID: 22630993
note if my problem is now with finding the right database rather than importing it, I'm happy to open a new question for this, given that if that is true, the XML conversion is working fine
What is SQL Server and how does it work?

The purpose of this paper is to provide you background on SQL Server. It’s your self-study guide for learning fundamentals. It includes both the history of SQL and its technical basics. Concepts and definitions will form the solid foundation of your future DBA expertise.

LVL 23

Expert Comment

ID: 22638742
Examine a few rows of the  'tables', 'text' and 'page' tables.

I believe you will find page information in the 'pages' table
and actual document texts in the 'text' table.

The thing to keep in mind, is you are importing a MediaWiki  WIKI Database.

It is much more complicated than a simple database of words and definitions;
words are defined within the body of  "pages"  in the Wiktionary wiki.

All text contents should be in the 'text' table, if you imported properly.
Including old revisions that aren't active definitions.

The 'pages'  table if imported properly,  should be the table of pages, and
there would be an integer indexing the proper entry in the 'text' table for the
current revision of the page.

Other 'texts'  that link to that page being historical revisions stored in the database.


Author Comment

ID: 22873428
sorry for the delayed response. It looks like you were correct, and the page_title and page_latest columns of the page table correspond to the word and the word_id, while the BLOB value in old_text and the id of old_id in the text table corresponds to the page content and the matching word_id. (I'm not yet sure what page_id in the page table is used for, since it doesn't correspond to old_id.)

so it appears I can now look up a word's page, but the page data looks like the below (for the word "pound"). It seems like this data is in a format which will be difficult to manipulate by, for example, associating each meaning with a meaning_id. Would it be correct to say that this database is useful for serving web pages, but not for actually manipulating the data they contain? What is the format the data is in now called?

Here's the beginning of the 'definition' (page) for the word "pound":

content is ==English== ===Pronunciation=== * {{IPA|/paŠnd/}} * {{audio|en-us-pound.ogg|Audio (US)}} * {{rhymes|aŠnd}} ===Etymology 1=== {{OE.}} ''[[pund]]'', from {{Ger.}} ''*punda-'', an early borrowing from {{L.}} ''[[pondo|pondM]]'' in ''[[libra|l+bra]] [[pondo|pondM]]'' a pound by weight (a misinterpretation: ''[[pondo|pondM]]'' is the adverb meaning by weight, originally an ablative form of ''[[pondus]]''). Cognate with Dutch ''[[pond]]'', German ''[[Pfund]]'', Swedish ''[[pund]]''. ====Noun==== {{en-noun}} # A [[unit]] of [[weight]]: of 16 [[ounce]]s in the [[avoirdupois]] system (= 453.592 g) or of 12 [[troy ounce]]s in the [[troy weight|troy system]] (= 373.242 g). Its symbol is [[lb]]. # [[pound-force|Pound-force]]. # {{US}} The symbol '''#''' # A unit of [[currency]] used in Cyprus; Egypt; Lebanon; the United Kingdom and its [[dependency|dependencies]]; and formerly in the Republic of Ireland (which now uses the [[euro]]) and Israel (which now uses the [[sheqel]]). Its symbol is [[£]]. =====Translations===== {{trans-top|unit of mass}} * Arabic: {{Arab|[[('HF/]]}} {{IPAchar|(baund)}} {{m}} * [[Basque]]: [[libra]] * Bosnian: [[funta]] {{f}} * Bulgarian: [[DC=B]] {{m}} * [[Catalan]]: [[lliura]] {{f}} * Chinese: [[Å]] (bàng) * Czech: [[libra]] {{f}} * Danish: [[pund]] {{n}} * Dutch: [[pond#Dutch|pond]] {{n}} * Finnish: [[naula]], [[pauna]] * French: [[livre]] {{f}} * German: [[Pfund]] {{n}} * Greek: [[»¯²Á±]] {{f}} * Hungarian: [[font#Hungarian|font]] * Irish: [[punt]] {{m}} * Italian: [[libbra]] {{f}} {{trans-mid}} * Japanese: [[ÝóÉ]] (pondo) * Korean: [[ ´Ü]] (paundeu) * Lithuanian: [[svaras]] * [[Luxembourgish]]: [[Pond]] * Maltese: [[libra]] {{f}} * Norwegian: [[pund]] {{n}} * Polish: [[funt]] {{m}} * Portuguese: [[libra]] {{f}} * Russian: [[DC=B]] (funt) {{m}} * Slovak: [[libra]] {{f}} * Slovene: [[funt]] {{m}} * Spanish: [[libra]] {{f}} * Swahili: [[ratili]] ''(noun 9/10)'' * Swedish: [[pund]] {{n}} * Yiddish: [[äÕàØ]] (punt) {{n}} {{trans-bottom}} {{trans-top|US: symbol #}} * Finnish: [[ristikkomerkki]], [[risuaita]] (''colloquial'') * Spanish: [[almohadilla]] {{f}} {{italbrac|Spain}}, [[gato]] {{m}} {{italbrac|Chile|Mexico}}, [[libra]] {{f}} {{italbrac|Puerto Rico}}, [[numeral]] {{m}} {{italbrac|Latin America}}, [[número]] {{m}} {{italbrac|Latin America}} {{trans-mid}} {{trans-bottom}} {{trans-top|unit of currency}} * Czech: [[libra]] {{f}} * Dutch: [[pond#Dutch|pond]] {{n}} * Finnish: [[punta]] * French: [[livre]] {{f}} * German: [[Pfund]] {{n}} * Greek: [[»¯Á±]] {{f}} {{trans-mid}} * Irish: [[punt]] {{m}} * Japanese: [[ÝóÉ]] (pondo) * Maltese: [[lira]] {{f}} * Spanish: [[libra]] {{f}} * Swedish: [[pund]] {{n}} {{trans-bottom}} =====Synonyms===== * (''unit of currency''):

LVL 23

Expert Comment

ID: 22910817
yes...   the data in the table contains the text of the page in Wikicode and is truly meant to be processed by a MediaWiki  server  to render a web page.

It is this way by the very nature Wiktionary is built  (contributors edit wiki pages not 'definition lists',  and the pages they edit are stored).

To use it in another manner, you will need to decide how you want to try to parse the text,  and it is an extremely difficult task and could take a skilled programmer a long time to perfect a parser for extracting only the definition  but all the definition.

There may be ways to extract definitions from the text which for some words incldues many different sections formatted different ways;  to the extent that there is consistency in the way Wiktionary  definitions have been formatted.

Actually, pages on Wiktionary can be long, and there aren't formatting conventions precise enough that you can easily scrape  a simple  one-paragraph, plaintext definition.

Any parser one can easily write would be heuristic in nature,  and  would fail for some words in the Wiktionary  where the wiki page is formatted differently enough to be an exception that confuses parsing rules that work for other pages.


Author Closing Comment

ID: 31502631

Featured Post

Back Up Your Microsoft Windows Server®

Back up all your Microsoft Windows Server – on-premises, in remote locations, in private and hybrid clouds. Your entire Windows Server will be backed up in one easy step with patented, block-level disk imaging. We achieve RTOs (recovery time objectives) as low as 15 seconds.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Linux Mailserver setup & redundancy 2 82
Zimbra on Amazon Linux help 7 114
reverse engineer .sql from php files 11 53
Linux kernel panic ext3-fs error 14 33
It’s 2016. Password authentication should be dead — or at least close to dying. But, unfortunately, it has not traversed Quagga stage yet. Using password authentication is like laundering hotel guest linens with a washboard — it’s Passé.
Creating and Managing Databases with phpMyAdmin in cPanel.
Connecting to an Amazon Linux EC2 Instance from Windows Using PuTTY.
Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.

733 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question