Script to Parse Data from Wikipedia

Discussion in 'Computer Questions, Issues & Security' started by b4rbz, Feb 27, 2014.

  1. b4rbz

    b4rbz
    uix_expand uix_collapse
    New Member

    Joined:
    Feb 27, 2014
    Messages:
    1
    Likes Received:
    0
    Hey guys, first post! I want to make a website, but the website I want to make requires lots of information and data to be added to a database. How would I go about parsing and storing data from a Wikipedia page? I figured a Python script of some sort would do but I wouldn't know where to begin.
     
  2. azraf

    azraf
    uix_expand uix_collapse
    Member

    Joined:
    Mar 2, 2009
    Messages:
    78
    Likes Received:
    3
    for parsing data, there is lots of scripts available. You can check Python's "Scrapy" framework. PHP's "cUrl" library. NodeJS's "cheerio" or "JSDom" module, or even "PhantomJS" for JS rendered pages. There is also lot more options are available.

    Just ask google and that will give you detail.
     
  3. bjdea2

    bjdea2
    uix_expand uix_collapse
    Member

    Joined:
    Mar 16, 2013
    Messages:
    159
    Likes Received:
    5
    That's a big project !

    You'd need to look for common keywords/signposts in each wikipedia page, like the pages structures and look for the common keywords that can serve as data extraction start points and end points. For example look for the part in each page that gives the title data, content data, link data, etc. In fact you will need to work out exactly what it is you want to extract from each page.

    Note this all assumes you know how to use cURL in your script to retrieve the pages for parsing.

    You will then need to enter the extracted info into your own custom database or into a MySQL database.

    This can all be done but it would require a lot of time and work.
     

Share This Page