asked on

To what extent would it be possible to harvest data (data mining/web scraping/data harvesting) from selected websites and have collected data sorted with minimum of manual labour?

To what extent would it be possible to harvest data (data mining/web scraping/data harvesting) from selected websites and have collected data sorted with minimum of manual labour? The purpose is to collect historical data for harness racing horses. What type of software would be useful, and what is the best one for this purpose?

The structure for the data is horsename (can be more than 2 words, can differ between, for example 1 and 5 words) and comments about horsename.

The different websites are of course very differently structured. Here are some examples:

http://norrtrav.se/spela-pa-hastar/spela-pa-hastar-3/ (comments about the horses are in the PDFs)
http://www.travronden.se/intervjuer (a searchable database where you can enter the horsename, for example "Danne Edel", to view all historical interviews that has been conducted for that horse)

Besides these selected websites, I also would like to search everything available what regards interviews with the trainer/driver on internet for around 150 horses at a time. So I wonder if there is any alternative to just using Google and search for "Danne Edel"?

The idea is that by collecting historical interviews with the trainers/drivers I will be able to get a complete profile of the horse. And I would like to store this in a database with minimum of manual labour. So that I get a huge database with horsenames and all historical interviews with the trainers/drivers are stored with each horsename.

A third method would be to search internet with this combination (using the horse name "Danne Edel" as an example):

"Danne Edel"
loppkommentarer OR resultatkommentarer

(loppkommentarer: comments about a race; resultatkommentarer: comments about a race)

A fourth method would be to search my own downloaded PDFs (a large number) such as the one uploaded here. For example, I could search for the horsename "Jeppe Yamoz" as you can see in the PDF and the "loppkommentarer"/comments about the race, written with bold charactes beneath the horsename should be extracted to this horse's database. There are 5 such comments about the race for each horse, written with bold characters. For example, the first such comments about the race begins with "Ner inv e 150" (which is not complete sentences but written often with abbreviations). After these 5 comments comes next horsename with its 5 comments.

To conclude, I would like to use data mining/web scraping/data harvesting to automatically build a huge database for horsenames that contains historical interviews with trainers/drivers as well as comments about historical trotting races. The idea is to be able to get a complete profile of each horse.
v86-loppkommentarer.pdf

ASKER CERTIFIED SOLUTION

arnold

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

Steve Bink

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

hermesalpha

ASKER

Thanks for your different angles on this, I will begin with trying to pull data from some websites and see if I can get it structured right first.

Pratul Sricastava

Using a web scraper you can extract data from multiple websites to a single spreadsheet (or database) so that it becomes easy for you to analyze (or even visualize) the data.

Usually, data available on the Internet is only viewable with a web browser, and has little or no structure. Almost all the websites do not provide users with the functionality to save a copy of the data displayed on the web. The only option is human’s manual copy-and-paste action. No doubt that it will be time-consuming and boring to manually capture and separate these data you want exactly. Fortunately, the web scraping technique can execute the process automatically and organize them very well in minutes, instead of manually coping the data from websites.

Nowadays, web scraping has been widely used in various fields, such as news portals, blogs, forums, e-commerce websites, social media, real estate, financial reports, And the purposes of web scraping are also various, including contact scraping, online price comparison, website change detection, web data integration, weather data monitoring, research, etc.