Research:Data

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
Languages:

There is a great deal of publicly-available, open-licensed data about Wikimedia projects. This page is intended to help community members, developers, and researchers who are interested in analyzing raw data learn what data and infrastructure is available.

If you have any questions, you might find the answer in the Frequently Asked Questions about Data. If you still have questions, you can email your question to the analytics@lists.wikimedia.org mailing list.

If you wish to browse pre-computed metrics and dashboards, see statistics.

If this publicly available data isn't sufficient, you can look at the page on private data access to see what non-public data exists and how you can gain access.


If you wish to donate or document any additional data sources, you can use the Wikimedia organization on DataHub.

See also inspirational example uses.

Also consider searching for datasets at Zenodo, Figshare, Dimensions.ai or Google Dataset Search.

Quick glance[edit]

Data Dumps (details)

Homepage | Download

Dumps of all WMF projects for backup, offline use, research, etc.

APIs (details)

  • The MediaWiki API provides direct, high-level access to the data contained in MediaWiki databases over the web.
    • Meta info about the wiki and logged-in user, properties of pages (revisions, content, etc.) and lists of pages based on criteria
    • JSON, WDDX, XML, YAML, and PHP's native serialization format

Database access (Toolforge, PAWS, Quarry) (details)

The Toolforge hosting environment allows you to connect to shared server resources and query a copy of the Wikimedia project's content databases.

  • acts as a standard web server hosting web-based tools
  • command-line tools
  • account required

PAWS is a Jupyter Notebook environment within Toolforge that allows e.g. querying database replicas and APIs for analysis.

Quarry is a public web interface allowing SQL queries to database replicas.

Recent changes stream (details)

Homepage

Wikimedia broadcasts every change to every Wikimedia wiki using the Socket.IO protocol.

Analytics Dumps (details)

Homepage

Raw pageviews, unique device estimates, mediacounts, etc.

WikiStats (details)

Homepage

Reports in 25+ languages based on data dumps and server log files.

  • Unique visits, page views, active editors and more
  • Intermediate CSV files available.
  • Graphical presentation.
  • Monthly

DBpedia (details)

Homepage

DBpedia extracts structured data from Wikipedia. It allows users to run complex queries and link Wikipedia data to other data sets.

  • RDF, N-triplets, SPARQL endpoint, Linked Data
  • billions of triplets of info in a consistent Ontology

DataHub and Figshare (details)

DataHub Homepage

A collection of various Wikimedia-related datasets.

Figshare (datasets taggd 'wikipedia')

Data dumps[edit]

Home page[edit]

Data dumps


Description[edit]

WMF releases data dumps of Wikipedia and all WMF projects on a regular basis, as well as dumps of other Wikimedia-related data such as search indices and short url mappings.

Content[edit]

xml/sql dumps[edit]

  • Text of current and/or all revisions of all pages, in XML format (schema)
  • Metadata for current and/or all revisions of all pages, in XML format (schema)
  • Most database tables as sql files
    • Page-to-page link lists (pagelinks, categorylinks, imagelinks, templatelinks tables)
    • Lists of pages with links outside of the project (externallinks, iwlinks, langlinks tables)
    • Media metadata (image, oldimage tables)
    • Info about each page (page, page_props, page_restrictions tables)
    • Titles of all pages in the main namespace, i.e. all articles (*-all-titles-in-ns0.gz)
    • List of all pages that are redirects and their targets (redirect table)
    • Log data, including blocks, protection, deletion, uploads (logging table)
    • Misc bits (interwiki, site_stats, user_groups tables)
  • Stub-prefixed dumps for some projects which only have header info for pages and revisions without actual content

other dumps[edit]

  • Static HTML dumps for 2007-2008 [1]
  • adds/changes dumps (no moves or deletes + some other limitations) [2] (docs)
  • wikidata entity dumps [3]
  • full list of what's available: [4]

(see more)

Downloading[edit]

You can download the latest dumps (for the last year) here (dumps.wikimedia.org/enwiki/ for English Wikipedia, dumps.wikimedia.org/dewiki/ for German Wikipedia, etc).

Archives : dumps.wikimedia.org/archive/

Current mirrors offer an alternative to the download page.

Due to large file sizes, using a download tool is recommended.

Many older dumps can be found at the Internet Archive.

Data format[edit]

XML dumps since 2010 are in the wrapper format described at Export format (schema). Files are compressed in gzip (.gz), bzip2/lbzip2 (.bz2) and .7z formats.

SQL dumps are provided as dumps of entire tables, using mysqldump.

Some older dumps exist in various formats.

How to and examples[edit]

See examples of importing dumps in a MySQL database with step-by-step instructions here .

Existing tools[edit]

Available tools are listed in the following locations, but information is not always up-to-date:

Access[edit]

All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL). Images and other files are available under different terms, as detailed on their description pages.

Support[edit]

Research projects using data from this source[edit]


MediaWiki API[edit]

Description[edit]

The web service API provides direct, high-level access to the data contained in MediaWiki databases. Client programs can log in to a wiki, get data, and post changes automatically by making HTTP requests.

Content[edit]

Endpoint[edit]

To query the database you send a HTTP GET request to the desired endpoint (example http://en.wikipedia.org/w/api.php for English Wikipedia) setting the action parameter to "query" and defining the query details the URL.

How to and examples[edit]

Here's a simple example:

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&titles=Main%20Page

This means fetch (action=query) the content (rvprop=content) of the most recent revision of Main Page (titles=Main%20Page) of English Wikipedia (http://en.wikipedia.org/w/api.php? )in XML format (format=xml). You can paste the URL in a browser to see the output.

Further ( and more complex) examples can be found here.

Also see :

Existing tools[edit]

To try out the API interactively, use the Api Sandbox.

Access[edit]

To use the API, your application or client might need to log in.

Before you start, learn about the API etiquette.

Researchers could be given Special access rights on case-to-case bases.

All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL).

Support[edit]

FAQ: http://www.mediawiki.org/wiki/API:FAQ

Mailing list: mediawiki-api

Toolforge and PAWS[edit]

Toolforge hosts command line or web-based tools, which can query copies of the database. Copies are generally real-time but sometimes replication lag occurs.

PAWS is a Jupyter Notebook environment within Toolforge that allows e.g. querying database replicas for analysis.

Home page[edit]

https://toolforge.org/

Description[edit]

Content[edit]

Toolforge hosts copies of the databases of all Wikimedia projects including Commons. You are allowed use the contents of the database as long as you don't violate the rules.

Data format[edit]

Learn more about the current database schema.

How to[edit]

Using Toolforge requires familiarity with Unix/Linux command line, SSL keys, SQL/databases, and some programming.

To start using the Toolforge, see this Quickstart guide.

Existing tools[edit]

See https://admin.toolforge.org/

Support[edit]

  • Via mailing list: cloud@lists.wikimedia.org A list for announcements and discussion related to the Wikimedia Cloud VPS project. You can find the archives here: [5]

Projects using Toolforge / Toolserver data[edit]

(In 2014 Toolforge replaced the "Toolserver" server cluster managed by WMDE.)

On the old toolserver:

Recent changes stream[edit]

See wikitech:EventStreams to subscribe to Recent changes to all Wikimedia wikis. This broadcasts edits and other changes as they happen; confirmation that an edit has completed is typically faster over this than through the browser.

Old IRC recent changes feed[edit]

Wikimedia also has IRC feeds of recent changes hosted on the irc.wikimedia.org server. EventStreams is more robust and easier to parse, but the old system is still operational and its details follow.

  • Changes shown automatically as they happen.
  • Feeds for each wiki in a separate channel.
  • Filtered feeds available with cloak

Data and format[edit]

Each wiki edit is reflected in the wiki's IRC channel. Displayed URLs give the cumulative differences produced by the edit concerned and any subsequent edits. The time is not listed but timestamping may be provided by your IRC-client.

The format of each edit summary is :

[page_title] [URL_of_the_revision] * [user] * [size_of_the_edit] [edit_summary]

You can see some examples below:

<rc-pmtpa> Talk:Duke of York's Picture House, Brighton http://en.wikipedia.org/w/index.php?diff=542604907&oldid=498947324 *Fortdj33* (-14) Updated classification

<rc-pmtpa> Bloody Sunday (1887) http://en.wikipedia.org/w/index.php?diff=542604908&oldid=542604828 *03184.61.149.187* (-2371) /* Aftermath */

Location[edit]

IRC feeds are hosted on the irc.wikimedia.org server.

Every one of the >730 Wikimedia wikis has an IRC RC feed. The channel name is #lang.project. For example, the channel for German Wikibooks channel is #de.wikibooks.

Existing tools[edit]

  • wm-bot lets you get IRC feeds filtered according to your needs. You can define a list of pages and get notifications of revisions on those pages only.
  • WikiStream uses IRC feeds to illustrate the amount of activity happening on Wikimedia projects.
  • wikimon is a WebSocket-oriented monitor for the IRC feeds

Access[edit]

Anyone can access IRC feeds. However, you need a wm-bot.

Analytics dumps[edit]

Home Page[edit]

Pageview statistics[edit]

https://dumps.wikimedia.org/other/pageview_complete/readme.html

Content[edit]

Each request of a page reaches one of Wikimedia's varnish caching hosts. The project name and the title of the page requested are logged and aggregated hourly. New higher quality data filtered of spiders available since May 2015. Deprecated pagecounts-raw has English statistics since 2007 and non-English since 2008. And pagecounts-complete stitches the best available data at all times.

Files starting with "project" contain total hits per project per hour statistics. A separate set with repaired counts is maintained as well (several cases of multi-month underreporting could be fixed from secondary sources)

Note: These are not unique hits and changed titles/moves are counted separately.

Download[edit]

https://dumps.wikimedia.org/other/pageview_complete/

Data format[edit]

See the README for up-to-date details on the format. For data since 2015, however, the format is:

[Project] [Article_name] [Page_id] [agent] [Daily total] [Hourly counts from 0 to 23, written as 0 = A, 1 = B ... 22 = W, 23 = X]

Examples:

ab.wikipedia Абиблиа 4651 desktop 2 K1R1 means that the Abkhazian Wikipedia article for ab:Абиблиа was viewed twice, once in the 10th hour and once in the 17th hour (UTC time).

Existing tools[edit]

You can interactively browse the page view statistics at: https://pageviews.toolforge.org

Support[edit]

More details on Pageviews Analysis tool can be found in its documentation.

Research projects using data from this source[edit]

WikiStats[edit]

Home page[edit]

http://stats.wikimedia.org/

Also see: mw:Analytics/Wikistats

Description[edit]

Wikistats is an informal but widely recognized name for a set of reports developed by Erik Zachte initially in 2003, which provide monthly trend information for all Wikimedia projects and wikis.

Content[edit]

Many dashboards that display trends about reading, contributing, and content broken down by different projects such as:

  • unique visitors
  • page views (overall and mobile only)
  • editor activity
  • article count

Data format[edit]

Data is presented as charts with the option to download the underlying data.

Support[edit]

For more details on Wikistats, see the documentation

DBpedia[edit]

Home page[edit]

http://dbpedia.org

Description[edit]

DBpedia.org is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia and to link other datasets on the Web to Wikipedia data.

Content[edit]

English version of the DBpedia knowledge base

  • describes 3.77 million things
  • 2.35 million are classified in a consistent Ontology(persons, places, creative works like music albums, films and video games, organizations like companies and educational institutions, species, diseases, etc.

Localized versions of DBpedia in 111 language

  • together describe 20.8 million things, out of which 10.5 million overlap (are interlinked) with concepts from the English DBpedia

The data set also features:

  • about 2 billion pieces of information (RDF triples)
  • labels and abstracts for >10 million unique things in up to 111 different languages
  • millions of
    • links to images
    • links to external web pages
    • data links into external RDF data sets
    • links to Wikipedia categories
    • YAGO categories

Data format[edit]

  • RDF/XML
  • Turtle
  • N-Triplets
  • SPARQL endpoint

Download[edit]

http://wiki.dbpedia.org/Downloads38 has download links for all the data sets, different formats and languages.

http://dbpedia.org/sparql - DBpedia's SPARQL endpoint

How to and examples[edit]

  • Use cases shows the different ways you can use DBpedia data ( such as improving Wikipedia search or adding Wikipedia content to your webpage)
  • Applications (broken link!) shows the various applications of DBpedia including faceted browsers, visualization, URI lookup, NLP and others.

Existing tools[edit]

  • DBpedia Spotlight is a tool for annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia.
  • RelFinder is a tool for interactive relationship discovery in RDF data

Access[edit]

DBpedia data from version 3.4 on is licensed under the terms of the Creative Commons Attribution-ShareAlike 3.0 License and the GNU Free Documentation License.

Support[edit]

Mailing list: DBpedia Discuss

More:

Research projects using data from this source[edit]

DataHub[edit]

The Wikimedia organization on the Open Knowledge Foundation's DataHub is a collection of datasets about Wikipedia and other projects run by the Wikimedia Foundation.

The DataHub repository is meant to become the place where all Wikimedia-related data sources are documented. The collection is open to contributions and researchers are encouraged to donate relevant datasets.

The Wikimedia group on DataHub points to some additional data sources not listed on this page. Some examples are:

Wikivoyage also maintains data on its own DataHub:

  • Hotels/restaurants/attractions data as CSV/OSM/OBF
  • Tourism guide for offline use