how do I follow these instructions to convert the XML wiktionary database to MYSQL?

Posted on 2008-10-02
Last Modified: 2010-04-21
I am trying to follow this tutorial to convert wiktionary to a mysql database:

I downloaded mwdumper.jar and the relevant wiktionary dump from (specifically, pages-articles.xml.bz2)

I then created the mysql database 'wiktionary' and created the tables using tables.sql linked to in the tutorial, and ran the suggested command from the terminal, getting this result

java -jar /Applications/mwdumper.jar --format=sql:1.5 /Applications/enwiktionary-20080613-pages-articles.xml.bz2 | mysql -u root -p wiktionary
-bash: mysql: command not found
1,000 pages (687.285/sec), 1,000 revs (687.285/sec)
2,000 pages (531.915/sec), 2,000 revs (531.915/sec)
3,000 pages (613.874/sec), 3,000 revs (613.874/sec)
4,000 pages (760.456/sec), 4,000 revs (760.456/sec)
[this goes on for about 920,000 pages]

This doesn't seem to have actually had any result, since I don't see any data in my wiktionary database. Presumably this has something to do with the "mysql: command not found" error. (I get the same result running as root.)

how can I fix this problem? for what it's worth, I am running mysql on OS X, installed via XAMPP (, and my knowledge of linux is paltry.
Question by:bitt3n
  • 4
  • 3
LVL 23

Accepted Solution

Mysidia earned 500 total points
ID: 22630663
-bash: mysql: command not found

That's a problem.   You need to have your mysql command line client working correctly.
If you can't type
mysql -u root -p wiktionary

at a shell prompt and have a 'mysql>'  prompt,  then that command won't work.

You need to either type the full path to the mysql client binary
or edit your PATH  environment variable to include the path to the 'mysql' program.


Author Comment

ID: 22630984
ok changing the path made something happen, since I'm now seeing data in my database, just not the data I was expecting.

what I was expecting was tables including a table of words and wordids and a table of definitions with corresponding wordids, so that I could call up definitions for words the way wiktionary itself does on pages like this:

what I'm getting is just three tables, text, revision, and page, and the only thing that looks like a list of words is the 'page_title' field in the page table, with no definitions in any of the tables. From it looks like I am downloading the right XML file for wiktionary, (specifically, pages-articles.xml.bz2), so I am confused why I don't see the data.

Is it possible that I am somehow still screwing up the XML to MySQL conversion, or am I being an idiot in some entirely unrelated way?

Author Comment

ID: 22630993
note if my problem is now with finding the right database rather than importing it, I'm happy to open a new question for this, given that if that is true, the XML conversion is working fine
Enterprise Mobility and BYOD For Dummies

Like “For Dummies” books, you can read this in whatever order you choose and learn about mobility and BYOD; and how to put a competitive mobile infrastructure in place. Developed for SMBs and large enterprises alike, you will find helpful use cases, planning, and implementation.

LVL 23

Expert Comment

ID: 22638742
Examine a few rows of the  'tables', 'text' and 'page' tables.

I believe you will find page information in the 'pages' table
and actual document texts in the 'text' table.

The thing to keep in mind, is you are importing a MediaWiki  WIKI Database.

It is much more complicated than a simple database of words and definitions;
words are defined within the body of  "pages"  in the Wiktionary wiki.

All text contents should be in the 'text' table, if you imported properly.
Including old revisions that aren't active definitions.

The 'pages'  table if imported properly,  should be the table of pages, and
there would be an integer indexing the proper entry in the 'text' table for the
current revision of the page.

Other 'texts'  that link to that page being historical revisions stored in the database.


Author Comment

ID: 22873428
sorry for the delayed response. It looks like you were correct, and the page_title and page_latest columns of the page table correspond to the word and the word_id, while the BLOB value in old_text and the id of old_id in the text table corresponds to the page content and the matching word_id. (I'm not yet sure what page_id in the page table is used for, since it doesn't correspond to old_id.)

so it appears I can now look up a word's page, but the page data looks like the below (for the word "pound"). It seems like this data is in a format which will be difficult to manipulate by, for example, associating each meaning with a meaning_id. Would it be correct to say that this database is useful for serving web pages, but not for actually manipulating the data they contain? What is the format the data is in now called?

Here's the beginning of the 'definition' (page) for the word "pound":

content is ==English== ===Pronunciation=== * {{IPA|/paŠnd/}} * {{audio|en-us-pound.ogg|Audio (US)}} * {{rhymes|aŠnd}} ===Etymology 1=== {{OE.}} ''[[pund]]'', from {{Ger.}} ''*punda-'', an early borrowing from {{L.}} ''[[pondo|pondM]]'' in ''[[libra|l+bra]] [[pondo|pondM]]'' a pound by weight (a misinterpretation: ''[[pondo|pondM]]'' is the adverb meaning by weight, originally an ablative form of ''[[pondus]]''). Cognate with Dutch ''[[pond]]'', German ''[[Pfund]]'', Swedish ''[[pund]]''. ====Noun==== {{en-noun}} # A [[unit]] of [[weight]]: of 16 [[ounce]]s in the [[avoirdupois]] system (= 453.592 g) or of 12 [[troy ounce]]s in the [[troy weight|troy system]] (= 373.242 g). Its symbol is [[lb]]. # [[pound-force|Pound-force]]. # {{US}} The symbol '''#''' # A unit of [[currency]] used in Cyprus; Egypt; Lebanon; the United Kingdom and its [[dependency|dependencies]]; and formerly in the Republic of Ireland (which now uses the [[euro]]) and Israel (which now uses the [[sheqel]]). Its symbol is [[£]]. =====Translations===== {{trans-top|unit of mass}} * Arabic: {{Arab|[[('HF/]]}} {{IPAchar|(baund)}} {{m}} * [[Basque]]: [[libra]] * Bosnian: [[funta]] {{f}} * Bulgarian: [[DC=B]] {{m}} * [[Catalan]]: [[lliura]] {{f}} * Chinese: [[Å]] (bàng) * Czech: [[libra]] {{f}} * Danish: [[pund]] {{n}} * Dutch: [[pond#Dutch|pond]] {{n}} * Finnish: [[naula]], [[pauna]] * French: [[livre]] {{f}} * German: [[Pfund]] {{n}} * Greek: [[»¯²Á±]] {{f}} * Hungarian: [[font#Hungarian|font]] * Irish: [[punt]] {{m}} * Italian: [[libbra]] {{f}} {{trans-mid}} * Japanese: [[ÝóÉ]] (pondo) * Korean: [[ ´Ü]] (paundeu) * Lithuanian: [[svaras]] * [[Luxembourgish]]: [[Pond]] * Maltese: [[libra]] {{f}} * Norwegian: [[pund]] {{n}} * Polish: [[funt]] {{m}} * Portuguese: [[libra]] {{f}} * Russian: [[DC=B]] (funt) {{m}} * Slovak: [[libra]] {{f}} * Slovene: [[funt]] {{m}} * Spanish: [[libra]] {{f}} * Swahili: [[ratili]] ''(noun 9/10)'' * Swedish: [[pund]] {{n}} * Yiddish: [[äÕàØ]] (punt) {{n}} {{trans-bottom}} {{trans-top|US: symbol #}} * Finnish: [[ristikkomerkki]], [[risuaita]] (''colloquial'') * Spanish: [[almohadilla]] {{f}} {{italbrac|Spain}}, [[gato]] {{m}} {{italbrac|Chile|Mexico}}, [[libra]] {{f}} {{italbrac|Puerto Rico}}, [[numeral]] {{m}} {{italbrac|Latin America}}, [[número]] {{m}} {{italbrac|Latin America}} {{trans-mid}} {{trans-bottom}} {{trans-top|unit of currency}} * Czech: [[libra]] {{f}} * Dutch: [[pond#Dutch|pond]] {{n}} * Finnish: [[punta]] * French: [[livre]] {{f}} * German: [[Pfund]] {{n}} * Greek: [[»¯Á±]] {{f}} {{trans-mid}} * Irish: [[punt]] {{m}} * Japanese: [[ÝóÉ]] (pondo) * Maltese: [[lira]] {{f}} * Spanish: [[libra]] {{f}} * Swedish: [[pund]] {{n}} {{trans-bottom}} =====Synonyms===== * (''unit of currency''):

LVL 23

Expert Comment

ID: 22910817
yes...   the data in the table contains the text of the page in Wikicode and is truly meant to be processed by a MediaWiki  server  to render a web page.

It is this way by the very nature Wiktionary is built  (contributors edit wiki pages not 'definition lists',  and the pages they edit are stored).

To use it in another manner, you will need to decide how you want to try to parse the text,  and it is an extremely difficult task and could take a skilled programmer a long time to perfect a parser for extracting only the definition  but all the definition.

There may be ways to extract definitions from the text which for some words incldues many different sections formatted different ways;  to the extent that there is consistency in the way Wiktionary  definitions have been formatted.

Actually, pages on Wiktionary can be long, and there aren't formatting conventions precise enough that you can easily scrape  a simple  one-paragraph, plaintext definition.

Any parser one can easily write would be heuristic in nature,  and  would fail for some words in the Wiktionary  where the wiki page is formatted differently enough to be an exception that confuses parsing rules that work for other pages.


Author Closing Comment

ID: 31502631

Featured Post

Comprehensive Backup Solutions for Microsoft

Acronis protects the complete Microsoft technology stack: Windows Server, Windows PC, laptop and Surface data; Microsoft business applications; Microsoft Hyper-V; Azure VMs; Microsoft Windows Server 2016; Microsoft Exchange 2016 and SQL Server 2016.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Load balancing is the method of dividing the total amount of work performed by one computer between two or more computers. Its aim is to get more work done in the same amount of time, ensuring that all the users get served faster.
Introduction This article is intended for those who are new to PHP error handling (  It addresses one of the most common problems that plague beginning PHP develop…
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
Learn how to get help with Linux/Unix bash shell commands. Use help to read help documents for built in bash shell commands.: Use man to interface with the online reference manuals for shell commands.: Use man to search man pages for unknown command…

910 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now