how do I follow these instructions to convert the XML wiktionary database to MYSQL?

Posted on 2008-10-02
Medium Priority
Last Modified: 2010-04-21
I am trying to follow this tutorial to convert wiktionary to a mysql database:


I downloaded mwdumper.jar and the relevant wiktionary dump from http://download.wikimedia.org/enwiktionary/20080613/ (specifically, pages-articles.xml.bz2)

I then created the mysql database 'wiktionary' and created the tables using tables.sql linked to in the tutorial, and ran the suggested command from the terminal, getting this result

java -jar /Applications/mwdumper.jar --format=sql:1.5 /Applications/enwiktionary-20080613-pages-articles.xml.bz2 | mysql -u root -p wiktionary
-bash: mysql: command not found
1,000 pages (687.285/sec), 1,000 revs (687.285/sec)
2,000 pages (531.915/sec), 2,000 revs (531.915/sec)
3,000 pages (613.874/sec), 3,000 revs (613.874/sec)
4,000 pages (760.456/sec), 4,000 revs (760.456/sec)
[this goes on for about 920,000 pages]

This doesn't seem to have actually had any result, since I don't see any data in my wiktionary database. Presumably this has something to do with the "mysql: command not found" error. (I get the same result running as root.)

how can I fix this problem? for what it's worth, I am running mysql on OS X, installed via XAMPP (xampp.org), and my knowledge of linux is paltry.
Question by:bitt3n
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 3
LVL 23

Accepted Solution

Mysidia earned 2000 total points
ID: 22630663
-bash: mysql: command not found

That's a problem.   You need to have your mysql command line client working correctly.
If you can't type
mysql -u root -p wiktionary

at a shell prompt and have a 'mysql>'  prompt,  then that command won't work.

You need to either type the full path to the mysql client binary
or edit your PATH  environment variable to include the path to the 'mysql' program.


Author Comment

ID: 22630984
ok changing the path made something happen, since I'm now seeing data in my database, just not the data I was expecting.

what I was expecting was tables including a table of words and wordids and a table of definitions with corresponding wordids, so that I could call up definitions for words the way wiktionary itself does on pages like this:


what I'm getting is just three tables, text, revision, and page, and the only thing that looks like a list of words is the 'page_title' field in the page table, with no definitions in any of the tables. From http://download.wikimedia.org/enwiktionary/20080613/ it looks like I am downloading the right XML file for wiktionary, (specifically, pages-articles.xml.bz2), so I am confused why I don't see the data.

Is it possible that I am somehow still screwing up the XML to MySQL conversion, or am I being an idiot in some entirely unrelated way?

Author Comment

ID: 22630993
note if my problem is now with finding the right database rather than importing it, I'm happy to open a new question for this, given that if that is true, the XML conversion is working fine
Ransomware-A Revenue Bonanza for Service Providers

Ransomware – malware that gets on your customers’ computers, encrypts their data, and extorts a hefty ransom for the decryption keys – is a surging new threat.  The purpose of this eBook is to educate the reader about ransomware attacks.

LVL 23

Expert Comment

ID: 22638742
Examine a few rows of the  'tables', 'text' and 'page' tables.

I believe you will find page information in the 'pages' table
and actual document texts in the 'text' table.

The thing to keep in mind, is you are importing a MediaWiki  WIKI Database.

It is much more complicated than a simple database of words and definitions;
words are defined within the body of  "pages"  in the Wiktionary wiki.

All text contents should be in the 'text' table, if you imported properly.
Including old revisions that aren't active definitions.

The 'pages'  table if imported properly,  should be the table of pages, and
there would be an integer indexing the proper entry in the 'text' table for the
current revision of the page.

Other 'texts'  that link to that page being historical revisions stored in the database.


Author Comment

ID: 22873428
sorry for the delayed response. It looks like you were correct, and the page_title and page_latest columns of the page table correspond to the word and the word_id, while the BLOB value in old_text and the id of old_id in the text table corresponds to the page content and the matching word_id. (I'm not yet sure what page_id in the page table is used for, since it doesn't correspond to old_id.)

so it appears I can now look up a word's page, but the page data looks like the below (for the word "pound"). It seems like this data is in a format which will be difficult to manipulate by, for example, associating each meaning with a meaning_id. Would it be correct to say that this database is useful for serving web pages, but not for actually manipulating the data they contain? What is the format the data is in now called?

Here's the beginning of the 'definition' (page) for the word "pound":

content is ==English== ===Pronunciation=== * {{IPA|/paŠnd/}} * {{audio|en-us-pound.ogg|Audio (US)}} * {{rhymes|aŠnd}} ===Etymology 1=== {{OE.}} ''[[pund]]'', from {{Ger.}} ''*punda-'', an early borrowing from {{L.}} ''[[pondo|pondM]]'' in ''[[libra|l+bra]] [[pondo|pondM]]'' a pound by weight (a misinterpretation: ''[[pondo|pondM]]'' is the adverb meaning by weight, originally an ablative form of ''[[pondus]]''). Cognate with Dutch ''[[pond]]'', German ''[[Pfund]]'', Swedish ''[[pund]]''. ====Noun==== {{en-noun}} # A [[unit]] of [[weight]]: of 16 [[ounce]]s in the [[avoirdupois]] system (= 453.592 g) or of 12 [[troy ounce]]s in the [[troy weight|troy system]] (= 373.242 g). Its symbol is [[lb]]. # [[pound-force|Pound-force]]. # {{US}} The symbol '''#''' # A unit of [[currency]] used in Cyprus; Egypt; Lebanon; the United Kingdom and its [[dependency|dependencies]]; and formerly in the Republic of Ireland (which now uses the [[euro]]) and Israel (which now uses the [[sheqel]]). Its symbol is [[£]]. =====Translations===== {{trans-top|unit of mass}} * Arabic: {{Arab|[[('HF/]]}} {{IPAchar|(baund)}} {{m}} * [[Basque]]: [[libra]] * Bosnian: [[funta]] {{f}} * Bulgarian: [[DC=B]] {{m}} * [[Catalan]]: [[lliura]] {{f}} * Chinese: [[Å]] (bàng) * Czech: [[libra]] {{f}} * Danish: [[pund]] {{n}} * Dutch: [[pond#Dutch|pond]] {{n}} * Finnish: [[naula]], [[pauna]] * French: [[livre]] {{f}} * German: [[Pfund]] {{n}} * Greek: [[»¯²Á±]] {{f}} * Hungarian: [[font#Hungarian|font]] * Irish: [[punt]] {{m}} * Italian: [[libbra]] {{f}} {{trans-mid}} * Japanese: [[ÝóÉ]] (pondo) * Korean: [[ ´Ü]] (paundeu) * Lithuanian: [[svaras]] * [[Luxembourgish]]: [[Pond]] * Maltese: [[libra]] {{f}} * Norwegian: [[pund]] {{n}} * Polish: [[funt]] {{m}} * Portuguese: [[libra]] {{f}} * Russian: [[DC=B]] (funt) {{m}} * Slovak: [[libra]] {{f}} * Slovene: [[funt]] {{m}} * Spanish: [[libra]] {{f}} * Swahili: [[ratili]] ''(noun 9/10)'' * Swedish: [[pund]] {{n}} * Yiddish: [[äÕàØ]] (punt) {{n}} {{trans-bottom}} {{trans-top|US: symbol #}} * Finnish: [[ristikkomerkki]], [[risuaita]] (''colloquial'') * Spanish: [[almohadilla]] {{f}} {{italbrac|Spain}}, [[gato]] {{m}} {{italbrac|Chile|Mexico}}, [[libra]] {{f}} {{italbrac|Puerto Rico}}, [[numeral]] {{m}} {{italbrac|Latin America}}, [[número]] {{m}} {{italbrac|Latin America}} {{trans-mid}} {{trans-bottom}} {{trans-top|unit of currency}} * Czech: [[libra]] {{f}} * Dutch: [[pond#Dutch|pond]] {{n}} * Finnish: [[punta]] * French: [[livre]] {{f}} * German: [[Pfund]] {{n}} * Greek: [[»¯Á±]] {{f}} {{trans-mid}} * Irish: [[punt]] {{m}} * Japanese: [[ÝóÉ]] (pondo) * Maltese: [[lira]] {{f}} * Spanish: [[libra]] {{f}} * Swedish: [[pund]] {{n}} {{trans-bottom}} =====Synonyms===== * (''unit of currency''):

LVL 23

Expert Comment

ID: 22910817
yes...   the data in the table contains the text of the page in Wikicode and is truly meant to be processed by a MediaWiki  server  to render a web page.

It is this way by the very nature Wiktionary is built  (contributors edit wiki pages not 'definition lists',  and the pages they edit are stored).

To use it in another manner, you will need to decide how you want to try to parse the text,  and it is an extremely difficult task and could take a skilled programmer a long time to perfect a parser for extracting only the definition  but all the definition.

There may be ways to extract definitions from the text which for some words incldues many different sections formatted different ways;  to the extent that there is consistency in the way Wiktionary  definitions have been formatted.

Actually, pages on Wiktionary can be long, and there aren't formatting conventions precise enough that you can easily scrape  a simple  one-paragraph, plaintext definition.

Any parser one can easily write would be heuristic in nature,  and  would fail for some words in the Wiktionary  where the wiki page is formatted differently enough to be an exception that confuses parsing rules that work for other pages.


Author Closing Comment

ID: 31502631

Featured Post

Get MongoDB database support online, now!

At Percona’s web store you can order your MongoDB database support needs in minutes. No hassles, no fuss, just pick and click. Pay online with a credit card. Handle your MongoDB database support now!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Join Greg Farro and Ethan Banks from Packet Pushers (http://packetpushers.net/podcast/podcasts/pq-show-93-smart-network-monitoring-paessler-sponsored/) and Greg Ross from Paessler (https://www.paessler.com/prtg) for a discussion about smart network …
In part one, we reviewed the prerequisites required for installing SQL Server vNext. In this part we will explore how to install Microsoft's SQL Server on Ubuntu 16.04.
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
In this video, Percona Solution Engineer Rick Golba discuss how (and why) you implement high availability in a database environment. To discuss how Percona Consulting can help with your design and architecture needs for your database and infrastr…
Suggested Courses

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question