Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win


how do I follow these instructions to convert the XML wiktionary database to MYSQL?

Posted on 2008-10-02
Medium Priority
Last Modified: 2010-04-21
I am trying to follow this tutorial to convert wiktionary to a mysql database:


I downloaded mwdumper.jar and the relevant wiktionary dump from http://download.wikimedia.org/enwiktionary/20080613/ (specifically, pages-articles.xml.bz2)

I then created the mysql database 'wiktionary' and created the tables using tables.sql linked to in the tutorial, and ran the suggested command from the terminal, getting this result

java -jar /Applications/mwdumper.jar --format=sql:1.5 /Applications/enwiktionary-20080613-pages-articles.xml.bz2 | mysql -u root -p wiktionary
-bash: mysql: command not found
1,000 pages (687.285/sec), 1,000 revs (687.285/sec)
2,000 pages (531.915/sec), 2,000 revs (531.915/sec)
3,000 pages (613.874/sec), 3,000 revs (613.874/sec)
4,000 pages (760.456/sec), 4,000 revs (760.456/sec)
[this goes on for about 920,000 pages]

This doesn't seem to have actually had any result, since I don't see any data in my wiktionary database. Presumably this has something to do with the "mysql: command not found" error. (I get the same result running as root.)

how can I fix this problem? for what it's worth, I am running mysql on OS X, installed via XAMPP (xampp.org), and my knowledge of linux is paltry.
Question by:bitt3n
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 3
LVL 23

Accepted Solution

Mysidia earned 2000 total points
ID: 22630663
-bash: mysql: command not found

That's a problem.   You need to have your mysql command line client working correctly.
If you can't type
mysql -u root -p wiktionary

at a shell prompt and have a 'mysql>'  prompt,  then that command won't work.

You need to either type the full path to the mysql client binary
or edit your PATH  environment variable to include the path to the 'mysql' program.


Author Comment

ID: 22630984
ok changing the path made something happen, since I'm now seeing data in my database, just not the data I was expecting.

what I was expecting was tables including a table of words and wordids and a table of definitions with corresponding wordids, so that I could call up definitions for words the way wiktionary itself does on pages like this:


what I'm getting is just three tables, text, revision, and page, and the only thing that looks like a list of words is the 'page_title' field in the page table, with no definitions in any of the tables. From http://download.wikimedia.org/enwiktionary/20080613/ it looks like I am downloading the right XML file for wiktionary, (specifically, pages-articles.xml.bz2), so I am confused why I don't see the data.

Is it possible that I am somehow still screwing up the XML to MySQL conversion, or am I being an idiot in some entirely unrelated way?

Author Comment

ID: 22630993
note if my problem is now with finding the right database rather than importing it, I'm happy to open a new question for this, given that if that is true, the XML conversion is working fine
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

LVL 23

Expert Comment

ID: 22638742
Examine a few rows of the  'tables', 'text' and 'page' tables.

I believe you will find page information in the 'pages' table
and actual document texts in the 'text' table.

The thing to keep in mind, is you are importing a MediaWiki  WIKI Database.

It is much more complicated than a simple database of words and definitions;
words are defined within the body of  "pages"  in the Wiktionary wiki.

All text contents should be in the 'text' table, if you imported properly.
Including old revisions that aren't active definitions.

The 'pages'  table if imported properly,  should be the table of pages, and
there would be an integer indexing the proper entry in the 'text' table for the
current revision of the page.

Other 'texts'  that link to that page being historical revisions stored in the database.


Author Comment

ID: 22873428
sorry for the delayed response. It looks like you were correct, and the page_title and page_latest columns of the page table correspond to the word and the word_id, while the BLOB value in old_text and the id of old_id in the text table corresponds to the page content and the matching word_id. (I'm not yet sure what page_id in the page table is used for, since it doesn't correspond to old_id.)

so it appears I can now look up a word's page, but the page data looks like the below (for the word "pound"). It seems like this data is in a format which will be difficult to manipulate by, for example, associating each meaning with a meaning_id. Would it be correct to say that this database is useful for serving web pages, but not for actually manipulating the data they contain? What is the format the data is in now called?

Here's the beginning of the 'definition' (page) for the word "pound":

content is ==English== ===Pronunciation=== * {{IPA|/paŠnd/}} * {{audio|en-us-pound.ogg|Audio (US)}} * {{rhymes|aŠnd}} ===Etymology 1=== {{OE.}} ''[[pund]]'', from {{Ger.}} ''*punda-'', an early borrowing from {{L.}} ''[[pondo|pondM]]'' in ''[[libra|l+bra]] [[pondo|pondM]]'' a pound by weight (a misinterpretation: ''[[pondo|pondM]]'' is the adverb meaning by weight, originally an ablative form of ''[[pondus]]''). Cognate with Dutch ''[[pond]]'', German ''[[Pfund]]'', Swedish ''[[pund]]''. ====Noun==== {{en-noun}} # A [[unit]] of [[weight]]: of 16 [[ounce]]s in the [[avoirdupois]] system (= 453.592 g) or of 12 [[troy ounce]]s in the [[troy weight|troy system]] (= 373.242 g). Its symbol is [[lb]]. # [[pound-force|Pound-force]]. # {{US}} The symbol '''#''' # A unit of [[currency]] used in Cyprus; Egypt; Lebanon; the United Kingdom and its [[dependency|dependencies]]; and formerly in the Republic of Ireland (which now uses the [[euro]]) and Israel (which now uses the [[sheqel]]). Its symbol is [[£]]. =====Translations===== {{trans-top|unit of mass}} * Arabic: {{Arab|[[('HF/]]}} {{IPAchar|(baund)}} {{m}} * [[Basque]]: [[libra]] * Bosnian: [[funta]] {{f}} * Bulgarian: [[DC=B]] {{m}} * [[Catalan]]: [[lliura]] {{f}} * Chinese: [[Å]] (bàng) * Czech: [[libra]] {{f}} * Danish: [[pund]] {{n}} * Dutch: [[pond#Dutch|pond]] {{n}} * Finnish: [[naula]], [[pauna]] * French: [[livre]] {{f}} * German: [[Pfund]] {{n}} * Greek: [[»¯²Á±]] {{f}} * Hungarian: [[font#Hungarian|font]] * Irish: [[punt]] {{m}} * Italian: [[libbra]] {{f}} {{trans-mid}} * Japanese: [[ÝóÉ]] (pondo) * Korean: [[ ´Ü]] (paundeu) * Lithuanian: [[svaras]] * [[Luxembourgish]]: [[Pond]] * Maltese: [[libra]] {{f}} * Norwegian: [[pund]] {{n}} * Polish: [[funt]] {{m}} * Portuguese: [[libra]] {{f}} * Russian: [[DC=B]] (funt) {{m}} * Slovak: [[libra]] {{f}} * Slovene: [[funt]] {{m}} * Spanish: [[libra]] {{f}} * Swahili: [[ratili]] ''(noun 9/10)'' * Swedish: [[pund]] {{n}} * Yiddish: [[äÕàØ]] (punt) {{n}} {{trans-bottom}} {{trans-top|US: symbol #}} * Finnish: [[ristikkomerkki]], [[risuaita]] (''colloquial'') * Spanish: [[almohadilla]] {{f}} {{italbrac|Spain}}, [[gato]] {{m}} {{italbrac|Chile|Mexico}}, [[libra]] {{f}} {{italbrac|Puerto Rico}}, [[numeral]] {{m}} {{italbrac|Latin America}}, [[número]] {{m}} {{italbrac|Latin America}} {{trans-mid}} {{trans-bottom}} {{trans-top|unit of currency}} * Czech: [[libra]] {{f}} * Dutch: [[pond#Dutch|pond]] {{n}} * Finnish: [[punta]] * French: [[livre]] {{f}} * German: [[Pfund]] {{n}} * Greek: [[»¯Á±]] {{f}} {{trans-mid}} * Irish: [[punt]] {{m}} * Japanese: [[ÝóÉ]] (pondo) * Maltese: [[lira]] {{f}} * Spanish: [[libra]] {{f}} * Swedish: [[pund]] {{n}} {{trans-bottom}} =====Synonyms===== * (''unit of currency''):

LVL 23

Expert Comment

ID: 22910817
yes...   the data in the table contains the text of the page in Wikicode and is truly meant to be processed by a MediaWiki  server  to render a web page.

It is this way by the very nature Wiktionary is built  (contributors edit wiki pages not 'definition lists',  and the pages they edit are stored).

To use it in another manner, you will need to decide how you want to try to parse the text,  and it is an extremely difficult task and could take a skilled programmer a long time to perfect a parser for extracting only the definition  but all the definition.

There may be ways to extract definitions from the text which for some words incldues many different sections formatted different ways;  to the extent that there is consistency in the way Wiktionary  definitions have been formatted.

Actually, pages on Wiktionary can be long, and there aren't formatting conventions precise enough that you can easily scrape  a simple  one-paragraph, plaintext definition.

Any parser one can easily write would be heuristic in nature,  and  would fail for some words in the Wiktionary  where the wiki page is formatted differently enough to be an exception that confuses parsing rules that work for other pages.


Author Closing Comment

ID: 31502631

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This post looks at MongoDB and MySQL, and covers high-level MongoDB strengths, weaknesses, features, and uses from the perspective of an SQL user.
In this blog post, we’ll look at how using thread_statistics can cause high memory usage.
Learn how to find files with the shell using the find and locate commands. Use locate to find a needle in a haystack.: With locate, check if the file still exists.: Use find to get the actual location of the file.:
Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.
Suggested Courses

604 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question