Determining the relative quality of one Wikipedia project to another: One approach with English, Spanish, Catalan, Galician, Aragonese and Euskera Wikipedias

This is not as polished or as finished as I would have liked.  Apologies. Life got in the way.  I think the core findings and methodologies are still interesting and worth sharing despite the lack of polishing.The raw data is here: Spanish female politicians, and may be useful in terms of understanding how the results differ when you include null values as zero, rather than leaving them out of the conclusion.

Recently, on the research list, there has been a discussion regarding understanding the relative quality of articles on one language Wikipedia project to another Wikipedia project.

 

Exactly how to go about doing that is a rather subjective task, as quality could be potentially defined as, well, subjective. I’m going to try to do this within a very limited context.

 

The reason for these limits is because the metrics for easily measuring quality largely depend on the specific field of inquiry.  Quality sport articles will have different features than quality articles about plants which will in turn have different features than quality articles about military battles.

 

At the same time, I am not a programmer.  I cannot easily do programming things that would allow me to do bulk analysis of articles.  I need a very small sample to be able to feasibly work with.

 

When dealing with different languages, there is also an issue of best sourced material.  Often, people use English sources because those are easily available.  In the case of translation of articles, in many cases, people appear to just use those sources or find a best local fit.   To get a better idea of the actual quality, it appears to me that the subject matter to assess quality should largely be outside the English speaking domain, for the purpose of best understanding source usage.  Quality, thus, cannot become purely based on the quality of the translation and the local translators willingness to use other sources.

 

After some thought, I have decided to use the articles for “Female MEPs for Spain.”  (Women representing Spain who are or have been Members of the European Parliament.) This is a small list and finite list, which means I can manually get a large amount of data for comparison purposes.  All the articles are about the same topic, and because they are all about women, there are unlikely to be issues related to systemic bias in content creation. These articles are likely to exist in English, Spanish and possibly other languages for Spain.   Most of the sources should be from Spain because the topic is Spain, so a better feel for local quality can be understood in the context of language.

 

For this analysis, the decision was made to not examine other languages outside the languages used in Spain.  This is because the sample for other languages is very small, and given the already small sample, it not likely to have a lot of useful information. In English, there are twenty total articles.

 

In Spanish, there are 14 articles.  In Catalan, there are 20 articles.  In Galician, there are 7 articles.  In Euskera, there are 3 articles.  In Occidental, there are 0 articles.  In Extramaduran, there are 0 articles.  In Aragonese, there is 1 total articles.  In the context of the level of coverage, the best languages are English and Catalan because they have 20 total articles.  The next best Wikipedia, in terms of level of coverage is Spanish, followed by Galician, Euskera and Aragonese.  Other languages in Spain are not represented.

 

This lends itself to a philosophical question: With the languages having uneven sample sizes, should the analysis for overall article quality be based on the actual articles and ignore the non-existent articles, or should it treat the non-existent articles as having values of zero?  In this case, I will use both measures to determine quality.  The emphasis will be placed more on the existing articles because it allows for actual comparisons between articles.

 

There are a lot of criteria for determining the quality of an article about a female Spanish politician.  By using many criteria, examining them together, you can begin to get a comprehensive idea as to the relative quality while realizing that each article’s quality may differ.

 

Generally, this analysis defines article quality on Wikipedia about a politician as having four components.   First is appearance, and the presence of things not necessarily connected to the article text.  This arguably is the least important criteria. It includes having key external links, easy ways to get simple information without having to read the text, and having a picture.  The second is the content of the article itself in terms of length and other general features related to overall perception of article quality that do not relate to the topic.  This would be the second least important criteria, because they are independent of the actual textual content in some ways. The third criteria would be sourcing.  This matters a great deal as it defines the foundation of knowledge.  The fourth criteria is comprehensiveness of the article as a “political biography” by having some of the features that define a good political biography.  These criteria should be weighted to favor the more important article criteria over the least important ones.

Article “Appearance” Criteria

As this criteria is the least important one, points were weighted to make this criteria count less than others, with a maximum total 4.85 points available.

 

The first criteria I am going to use is: Does the article have a picture?  I believe this is a criteria for quality because many people want to know what a politician looks like.  Related to this, Does the article use a high quality picture of the politician? If the article has 1 picture, it gets 1 point. If the article has a picture but only because it was derived/cropped from another picture, it gets half a point.   If the article has 2 or more pictures, it gets 2 points.[1]  Because none of the articles have images labeled being high quality, there is no value in assigning further value to pictures.

 

The second appearance related criteria I am going to use: External links found either in an infobox or on the article to the politician’s official page, and any official social media presence they may have.  The reason for including this criteria is because the personal belief of the importance of going to an officially sanctioned source as part of knowledge formation around a subject.  Half a point will be given for a link to an official site, and half a point will be given to a link to an official social media presence.  The most available points for an article will be 1.

 

The third appearance is related criteria is: Presence of an infobox and a footer.  Infoboxes provide a lot of quality information in an easy to consume manner.  Related to this, the presence of a footer that provides related conceptual topics, such as the person who preceded or proceeded the woman in her position or other members of the same political party.  The presence of an infobox has been assigned a value of 0.6 and the presence of a footer has been assigned the value of 0.25.

 

The final criterion is the presence of a warning box on the article that says there is a problem with the article that almost certainly relates to content.   It is a potentially strong visual cue to readers that the article is not of high quality. All articles have 1 point. If there is a warning on the page itself, the article loses a point.

 

Using only appearance criteria, the maximum value points an article could have is 4.85.   There are only two articles which have full points, both are found on Spanish Wikipedia: Pilar del Castillo and Rosa Díez.   The lowest theoretical possible points is zero.  No articles have zero.  The lowest quality article in terms of total points is Inés Ayala on Spanish Wikipedia.

 

Article “Text Quality” Criteria

The first text quality issue is article section.  The presence of article sections suggests the article has organization and real structure.  Each article gets 1 point for each unique article text related section.  Headers for external links, see alsos, and references are not counted.

 

The second criteria used for text quality was article length using words.  The method for determining this was to determine the length of the text of the article, minus references, external links, infobox text, footer text, table text, image descriptions and lists.  The articles were then sorted based on length.  The longest 20% of articles were given 4 points.  The next longest were given 3 points.  The middle fifth of articles were given 2 points. The next shortest were given 1 point.  The shortest 5% were given 0 points.  This was done to give longer articles comparatively more value if they were the longest, and less value if they were shorter without passing any relative judgment on the quality of the volume of the text.

 

The third criterion again uses article size.  This time it divides the number of words by 250 to derive a number.  The number 250 was largely to offset the large values given to the outliers by bringing the number down to be more in line with relative weighting used with other criteria.   This gives value to the actual length of the article, as opposed to relative length.

 

Using readability as a criterion was considered.  For English, Flesch-Kincaid was used.  For Spanish, Fernandez-Huerta Scale was used.  The problem was that in trying to use one or both for the languages they were not intended for led to results that seem implausible.  Both scales had problems with Euskera, claiming the articles were written at an extremely high level.   The sole Aragonese article had similar problems.  While Catalan and Galician appear to be somewhat in line with Spanish articles, the inability to use two of the three languages and the requirement to use another system for English means this readability is not feasible as a criterion.

 

Because Wikipedia articles generally do not have a predictable maximum ceiling for article word length or number of sections, there technically is no ceiling for the maximum number of points available in this category.

 

Article “Sourcing” Criteria

The first sourcing related criterion is total number of sources.  An article gets 0.3 points for each source found in the reference section of the article.   The number 0.3 was largely to offset the large values given to the outliers by bringing the number down to be more in line with relative weighting used with other criteria.

 

The second sourcing criterion is the language of the sources.  Linguistic diversity amongst Spain’s languages should assist in offsetting potential POV problems and assist in providing best coverage for politicians from areas where Spanish is not the sole regional language.  The use of other language sources also potentially provides a more global perspective on the politician’s influence. For every different, language an article has outside the language of the project, it will be rewarded half a point.  (In some cases, the original source may be broken.  In this case, the website language will be used and value given based on that.)  Few points are being rewarded because of a desire not to provide too many additional points to articles just by virtue of the article having sources.

 

The third sourcing criterion is diversity of sources.  Ideally, the article should draw from different types of sources in order to provide a comprehensive, factual and neutral presentation of the person’s political life.  This is in line both with Wikipedia’s 5 pillars and with the requirements of a good political biography.  The different types of sources include newspapers  (television, radio, magazines), books, academic and trade journals, academic and education websites, social media, government (and parliament) sites, conference/commercial/social organization (not political or governmental) sites, party/political websites, and official sites.  These were all weighted as one point for having a reference in these categories.

 

Article “Political Biography” Criteria

The criteria for a good political biography tends to involve, broadly, getting a better idea of how politics and government works, while reading about the subject of the biography.  Some of the criteria for good political biographies may be slightly problematic in a pure Wikipedia sense in that the source material just may not be available to address the points adequately.  Still, in at least some cases, there should be adequate material about two or three of the politicians involved that are represented in most languages to begin to get a adequate picture and allow these outliers to pull up the average for the remaining article subjects. Because of the relative importance of these criteria against all other criteria, each has 12 available points, where if the biography partially meets the requirement, some points may be given.  This provides a maximum total points of 72 points, which accounts for roughly half the available points the maximum article has available.

 

The assessment of these criteria is purely subjective.  To a certain extent, the criteria also universally require greater depth so as to contextualize events that take place in the life of a politician.  The shorter the article, the less information about a specific topic, the less points will be subjectively given. For instance, if the thought process is only provided for one particular incident and the explanation is short, then one point will be given.  If the explanation is longer or there are two incidents where short thought processes are explained, then two points may be given.

 

The first criteria is, does the article present information about how the person governed.  This includes basic information about what the person did while in power.  To get at least one point, there needed to be at least one or two facts about what legislation the politician was involved with or voted for.  Holding the office alone was not defined in this case as getting any sense of government.

 

The second criteria is, does the article present information on the thoughts of the leader in terms of how they governed.  The article needs to explain some of the politician’s thought process behind political decision making.  The article cannot present events absent any context as to the politician’s reasons for their actions.

 

The third criteria is, does the article provide insight into how the person impacted political structures and policies in Spain, their specific region or internationally.  Context needs to be provided as to the impact of these policies so the reader understands the short and long term consequences of the politicians actions. To a certain extent, if the article mentioned how the politician performed relative to their party during an election, at least one point was awarded.  One point may also have been awarded had some background information been provided about their political party works in a national, non-office holding context.

 

The fourth criteria is, does the article show how being in power impacted the individual politician.  This can be biological or personal.  The person had a heart attack, their hair went gray, etc.  Their involvement in politics ruined their relationships, or put them in a position where they met a future spouse, or kept them in the closet.  The individual went to jail, or was continually followed by journalists who allowed them no privacy.    In cases of the politician going to prison for corruption or being found guilty of corruption, zero points were awarded here unless details were provided on how this impacted them personally.

 

The fifth criterion is the biography does not separate the person in terms of having a purely private life, and having a purely public life.  The two should be explained as they relate to each other, especially as the person’s primary notability will be for being a politician.  Details about a politicians life should not be present just for the sake of having them there, but be contextualized against their political life.  At university, did they display an interest in politics? Did a labor dispute put them into a place where they became politically active inside a job?  What events led them to becoming a politician?  How did their previous life experiences prepare them for being a politician?   To a certain extent, the article having a paragraph with facts about their education and other details about their life earned one point.  Only after there were more of those details and they connected more directly in the text to their political activities was a biography more than 1 point.

 

The sixth criterion is the relevance of the biography to Spanish and other Europeans who may have been impacted by the political events the politician has been involved in.  Readers need to be able to understand the politician’s impact on their own lives.

 

With 72 available points for each article, the most points earned by any article was 16.  It was the Spanish language article about Rosa Díez.  On the other side, 13 articles were assessed as having 0 points.   Over half of these articles were English, accounting for 8 of the 13 total articles.  Catalan had 3 articles assessed as 0 points.  Spanish and Euskera each had 1 article.  Galicaian and Aragonese had 0 points.

 

 

Findings
Overall, the article quality across all languages was relatively poor.  Not a single article would be objectively defined as a good political biography.  In terms of Wikipedia, none of the sample articles met local Wikipedia standards for being a good article.  Most were extremely short, averaging 288 words across all languages.   Most were poorly sourced.  While having an average of 2.3 sources per article, the median and mode of zero give a better idea as to the actual volume of the sourcing.  Most articles lacked pictures, or had a picture that was cropped from another picture and of poor overall quality.   Most articles did not give the reader a clear idea of the policies the politician supported, nor the impact of legislation a politician supported had on the lives of the electorate.  Almost all articles failed to explain the wider political impact, or lack of impact, the politician had on Spain.  The articles, across all languages, were not very useful.

 

Overall when measured against the assessed criteria, Spanish Wikipedia had the highest quality of articles.   With the highest single article point total of 60, articles on Spanish Wikipedia averaged 18.75 points.  This is significantly higher than the next highest assessed language project, Galician Wikipedia which had an average point total of 11.27.

 

Rank Language Score
1 Spanish 18.76
2 Galician 11.28
3 Catalan 8.48
4 Euskera 8.38
5 English 7.81
6 Aragonese 4.00

 

Rounding things out, Catalan Wikipedia was third, Euskera was fourth, English was fifth and Aragonese was last.  Even when the absence of articles is factored in with null values for these articles, Spanish Wikipedia still ranks as having the best article quality. Catalan finishes second, English Wikipedia third, and Galician fourth.  The absence of 13 of the 20 available articles hurts Galician Wikipedia a lot.

 

Why is the quality of Spanish Wikipedia so relatively high?  Why is Galician the second best language project?   Spanish ranked first in 8 of the 11 criteria.  Galician Wikipedia finished in the top 1 or 2 for five of the criteria.  In some of these categories, they had almost a full point above the lower performing language projects.  This was particularly important in the category of political biography, where Spanish Wikipedia articles averaged 4.4 points and Galician Wikipedia averaged 3.2 points per article.  In contrast, English Wikipedia averaged 1.5 points and Catalan Wikipedia averaged 1.35 points. Galician Wikipedia also picked up almost a full point on both English and Catalan Wikipedia when it came to average number of sections per article, getting 1.7 points to English Wikipedia’s 1.05 points and Catalan Wikipedia’s 0.9 points.  Galician Wikipedia also picked up half a point on both Catalan and English Wikipedia when it came to sourcing, averaging 0.7 points per article, which was still measurably less than Spanish Wikipedia which averaged 2.2 points per article. As each source was worth 0.3 points, this gave Galician Wikipedia more opportunities to get points for quality when it came to language diversity and source diversity.[2]

 

The political quality of the article correlates strongly with the article length, total sources in the article, and the number of sections an article has.  It would be more surprising if these qualities did not correlate well to each other, because articles need length, sources and organization as part of being able to successfully meet the political biography quality criteria.

 

Footnotes

[1]  Originally, the intention was to give articles with a recognised quality picture used on it  1 point. In this case, high quality would have defined as the picture being recognised either on Commons or a local Wikipedia project where the image is being used as a good picture, or a quality picture.  Unfortunately, none of the pictures used in the article met this criteria and so this was not used.

[2] In reality, that did not happen because only one article on Galician Wikipedia had any sources, and it had 17 of them.  The number and diversity of sources was limited in the article, and consequently, both English and Catalan Wikipedia outperformed Galician Wikipedia on source language and source type diversity.

 

Belgian men’s goalball team departs for Finland for World Championships

The story on English Wikinews that I got published yesterday was Belgian men’s goalball team departs for Finland for World Championships .  Like other stories, this one was inspired by things I saw on Facebook.  There isn’t much being written about the world championships in the English language media, and trying to find some way to do that in a way that is newsworthy? PITA.  I went with the easiest solution to discuss an event that is supposed to have started yesterday but where the actual games do not start until June 30.  I found a team competing that posted something, anything, about going.  In this case, it was a selfie that basically said, “We’re at the airport!  Hi! We’re offer!” and ran with that.

 

Because there was so little information available elsewhere, because nothing had been written recently for most other teams, there is pretty much no information about other teams. :/  Sad for anyone else wanting to write about the competition before it starts who don’t have ins with the team because of a local.  We’ll see.

Japanese wheelchair basketball player Mari Amimoto leads in scoring at world championships

Yesterday’s Wikinews challenge was to take basically a one source piece of information I wanted to write about and make it into an actual more detailed article.  This was highly problematic, because well, two English Wikinews reviewers basically said only reporting from the official table of the leading scorers for a tournament is a little problem. (un problema pocita)

At the end the day, Japanese wheelchair basketball player Mari Amimoto leads in scoring at world championships was published.  It is a nice little story about the Women’s Wheelchair World Championships currently being played in Toronto, Canada.    The reviewers did a good job at dealing with the small little problems.  In any case, the article is from a perspective I don’t think that the other news outlets would take. (Though to be fair, I wouldn’t put it past the Paralympic Press people.  They can often be really good at doing those sort of stories, precisely because they are often writing for an international audience as opposed to a purely domestic one.)

That issue of trying to do a new take on something can be a big challenge when trying to write from limited, non-news sources.  Very hard to do.  Beyond that, as a journalist writing for Wikinews, I want to name drop.  As many athletes as I can mention, I like to do because I think the little bit of attention can be very good.

Because I’ve decided to try to write more about Wikinews, and because I want to go to the Rio Games, I feel like I need to start preparing now by more consistently writing about Paralympic sport.    Lost that thought.  Ah yeah.  I’ve decided to follow a number more accounts on Facebook to see if I can keep up with the “latest” news so I can write about it more.

If you have a Paralympic story idea that I can write about for Wikinews, please get in touch.  I would be pretty much open to anything.

Russians top podium on second day of European Deaf Swimming Championships

Russians top podium on second day of European Deaf Swimming Championships is an article I got published yesterday on English Wikinews.  Also, Spanish Wikinews.  It was one of those exercises in disaster.  I originally wrote the article based on preliminary results.  In between submitting the article for review and the article being reviewed, the preliminary results changed to final results.  This completely screwed the article text, because it made it all inaccurate.  Erk?

The event is a world championship for deaf sport, which is not aligned with International Paralympic Committee in the sense that deaf sport just isn’t.  (The politics of this is actually quite interesting, in why deaf sports didn’t join the Paralympic movement.  Also, there is apparently a fear in some places that deaf sport will completely disappear because the technology is much better, and hearing problems are becoming much more fixable. )  What does this mean in terms of writing Wikinews articles from faaaaaaar away in Spain for a competition in Russia?  It means finding secondary sources to verify facts is PITA and not actually very doable.  The results seem pretty newsworthy to me, but verification.  Verification.  Verification.  Erk. Erk. Erk.

Errors were all eventually addressed, and the article got fixed on English Wikinews and then published.  😦

New article published on Wikinews

I haven’t been very active on Wikinews lately.  At the moment, I am waiting to hear back from the TWG aff-comm liaison regarding getting two Wikimedia mailing lists set up.  The purpose is basically to establish a multilingual Wikimedia version of “Help A Reporter Out.”  With Wikimedia, I feel there are constant waiting games.  If not one person, than another.  Universities often seem like they operate faster than Wikimedia does. 😦  It’s morale killing, especially as these problems exist at all levels.

Anyway, those issues aside…  I got an article published yesterday on English Wikinews.  It is pretty crappy topic wise, and extensively covered by local Chicago media but hey.  I wrote something: Spelling error appears on Medill School of Journalism diplomas ‎.

 

Rape and the Sports Story: Some Observations for Sports Writers

The Sport Spectacle

Jessica Luther’s “Changing the Narrative” offers sports writers valuable guidance regarding the responsible reporting of stories about rape. It is full of good advice that should help sports writers to at least not make things worse. Luther’s excellent article inspired me to write out a few thoughts for the sports critic who wants to take their writing to a related, but slightly different place.

Given how often sports writers have to cover this kind of story, they might be wondering why “sports culture” has become synonymous with “rape culture.” Given the ubiquity of rape stories in sports news, perhaps it is time we entertained the possibility that sexual violence is not at the margin of sports culture, but is, in fact, at its center.

Some axiomatic observations:

Where there is segregation there is violence: Apartheid structures are enforced through violence. Radical segregation is enforced by terrorizing people. We need only look to the US’s history of lynchings…

View original post 249 more words

Thank you Cindy

I’m emotionally crushed.  For the second time this week, I had to write an obituary about a female Wikimedian involved in addressing the movement’s gender gap.  Wikimedian Cindy Ashley-Nelson died at the Wikimedia Conference in Berlin early yesterday morning.  Her death follows that of Wikimedian activist Adrianne Wadewitz who died earlier in the week after a rock climbing accident on March 29.

Both women were inspiring in terms of their leadership, their contributions to Wikipedia while being active behind the scenes in movement governance, and their dedication.  Like myself, both believed that contributing to Wikimedia projects could change the world, and that knowledge is power.  Their individual contributions embody that.

While I did not have personal relationships with either, they served as role models in the community and brought attention to issues in the community as insiders that would not have otherwise been possible.  They participated in an environment that can at times be incredibly hostile towards women while being very successful.  I cannot easily see how the holes they left will be filled. 😦

I am thankful that in their lives, they spent time contributing.  I hope they can continue to live on forever in the collective community memory.

Cindy, in memoriam

Vale Cindy. Your efforts in the movement will not be forgotten.

Walk the Talk

20140412-152531.jpg

Cynthia Ashley-Nelson died yesterday. She was attending the Wikimedia Conference as an AffCom member, and on Thursday had participated on her first annual AffCom meeting. The news about her death have surprised and shocked the people at the conference. I realise there are many people who might not be familiar with her, so I wanted to write a few words about her and the impact she made on those who knew her.

In my role as Board liaison to the Affiliations Committee, I had seen Cindy, as her friends called her, apply to become a member, and how she was finally elected to the committee. She had such a solid background, so relevant to the work AffCom does, she was such strong candidate, it was a no brainer for AffCom to elect her. They were not disappointed. Cindy was participative, incredibly engaged from the first day on the work of…

View original post 652 more words

“He had two knives” or when is a fact a fact?

When is a knife a knife?

“He had to knives” and “Police said he had two knives” are two separate facts. One of the problems some new journalists on English Wikinews is recognizing what is a fact, and the above is a classic example. It is really easy to write opinions or assertions as facts without intending to.

A fair amount of Wikinews writing is synthesis writing.  We may not be able to verify the facts ourselves by talking to sources, witnessing an event ourselves, or reading the original source material.  It is really important to understand the facts that the sources we have present to us. If a source says, “Police said he had two knives,” the source is asserting that it is not a fact that he had two knives.  The fact is the claim.  This is very different from a claim of “He had two knives,” where “he had two knives” is the fact.

This might seem like a minor quibble, but it has the potential to be hugely important.  Picture a court case.  All the evidence is clear: “He had two knives” which he used to make a peanut butter and jelly sandwich.  Lots of people saw him with two knives.  There were two dirty knives in the sink which had his fingerprints on them.  He admitted to having two knives, and using them to make sandwiches. There was a picture of him using two knives to make a sandwich.  His name was engraved on the  knives.  We know those knives are his.  That is a fact.

In other situations, this may not be a fact.  There was no picture of him with two knives.  They did not have his fingerprints on them.  The knives were not found in his house.  He denied that the knives belonged to him and that he used them to make a sandwich.  On the other hand, the police claim that he had two knives.  In this case, maybe the knives do belong to him.  (He could have wiped down his fingerprints, taken the knives out of the house, lied about not owning the knives and not making a sandwich.)  What we do know as a fact is the police made this claim.

These finer points do matter, and they impact how people understand the news they read.

Vale Adrianne Wadewitz

The Wikimedia movement is both very large, and very small. I never had the pleasure of meeting Adrianne Wadewitz but I was very aware of her work and her role in promoting the inclusion of women as participants and topics of articles on Wikipedia. She was very effective at drawing attention to a problem that needs attention, at a site where people sometimes form their core base of knowledge about a topic. Thus, it was sad news to learn of her passing this morning. 😦 She did great work and appeared to do so without alienating a lot of people, something that can be very difficult to do in the Wikimedia community. Her contributions to making Wikipedia, and the world, a better place will be missed.