Tux the Penguin reading books

FOSS Academic

More Mastodon Scraping Without Consent (Notes on Nobre et al 2022)

There’s a new paper out about Mastodon! But unfortunately, it’s a deeply problematic one.

Nobre et al’s “More of the Same? A Study of Images Shared on Mastodon’s Federated Timeline” is a paper that is now published in proceedings from International Conference on Social Informatics. (Unfortunately, it’s not open access.)

Because I’m currently researching the fediverse and blogging about that process, I thought I’d write up notes on this paper.

Why this paper? Frankly, because I’m pretty certain it violates the community norms, as well as terms of service, of many Mastodon instances. It instantly reminded me of the controversial paper from Zignani et al, “Mastodon Content Warnings: Inappropriate Contents on a Microblogging Platform”, which resulted in a scathing open letter and the retraction of a dataset from the Harvard Dataverse.

Nobre et al’s “More of the Same” is a study of image-sharing. The authors claim that it is about image-sharing on Mastodon, but really their focus is on images they culled from Mastodon.social’s federated timeline. They pulled 4M posts from 103K active users, of which 1M had images. Since they pulled posts from Mastodon.social’s federated timeline, they saw posts from 4K separate instances. The authors state that a “relevant number” of the images they found are “explicit.” They categorize the images as such after running them through Google’s Vision AI Safe Search system. They also run the images they find through Google’s image search to trace where the images came from and how they are shared on Mastodon. Ultimately, the authors don’t really make an argument, other than stating in passing that Mastodon needs better moderation, since people share explicit images.

In some ways, “More of the Same” lives up to its title: it’s more of the same poor scholarship that can be seen in Zignani et al (in fact, Nobre et al cite that controversial paper). Here are my critiques:

Key Problems in “More of the Same”

Violations of Community Standards and Privacy Policies

This is the big one. There has been a longstanding community prohibition against taking Mastodon posts out of context. This is something I have observed again and again as I do my participant observation in the fediverse. While many Mastodon users might be ok with having their posts taken, many are not. And many instances prohibit the practice in their terms of service. I myself am very careful about getting explicit consent from anyone I want to quote.

Did the authors examine the privacy policies or terms of service for all – or even any – of the 4K instances they culled content from? There’s no indication that they did. Indeed, unlike Zignani et al, these authors don’t even discuss privacy. The assumption seems to be that these posts are “public” and are thus fair game. But as the Association of Internet Researcher’s ethics guidelines have noted over the years, sometimes “public” doesn’t mean “public” – that is, it doesn’t mean “I want everyone in the world to see this.”

And most egregious is the fact that there is no human subjects/institutional review board information in the paper – it doesn’t even appear that the authors took that basic step.

Who Consented to be Analyzed by Google?

If there’s no indication that people consented to having their posts scraped, there’s definitely no indication that they consented to have their posts submitted to Google. But the authors do exactly that – twice. They ran images through Google’s Vision AI and also through Google’s image search.

How many of the 4K instances seek to avoid being indexed by Google? I’m actually not sure, but I have seen people discuss using robots.txt and other means to avoid having “public” posts be indexed. Regardless, exporting content from Mastodon into Google’s indexes violates community norms, if not terms of service for many instances.

What is “explicit?” And in what context?

The authors run images through Google safe search and find that a “relevant number” of them “may be explicit” – that they might have nudity, or violence, or are spoof images. The assumption here is that Google is the arbitor of explicitness online. Maybe Google is, maybe it isn’t, but one key aspect is missing: context. Are they porn images? Are they medical images? What context are they embedded in? Much like the “public” aspect discussed above, I have observed over and over again Fediverse users saying that they don’t want their posts taken out of context.

In addition, the authors note a number of the images are “spoof” images, per the Google analysis, and that spoof images are explicit. How so? Indeed, what does “spoof” even mean?

And, none of this accounts for the fact that Mastodon (and other fediverse sites) use content warnings and other devices to allow users a bit of control over what they see.

As I mentioned above, the observation that there are explicit images led the authors to make a rather weak, in-passing argument “moderation is necessary in the Mastodon environment as images, while being a popular media format, indeed display a consistent level of explicit content” (page 189). But again, without context, these authors cannot really judge whether or not the “Mastodon environment” is moderated or not.

A top-ten image sharing Mastodon instance is actually a Pleroma instance

The authors don’t publish a list of all the 4K instances they scraped content from, but they do post the Top Ten image-sharing instances. Given that they scrape Mastodon.social’s federated timeline, it’s no surprise that that instance is #1. There are also other, unsurprising inclusions on the top ten: botsin.space (#2), mastodon.online (#6).

There’s a surprising #3, though: milker.cafe. The authors talk about milker.cafe at length, because they note that milker.cafe’s 11 active users contributed a whopping 26K images to the dataset.

The problem is milker.cafe is a Pleroma instance, not a Mastodon one. Since they make claims about Mastodon not being moderated enough, this is a bit of a problem. I hadn’t heard of milker.cafe, so I took a look and got – surprise, surprise – eyeballs full of busty waifus… once I clicked through the “Adults Only” warnings. How many of Nobre et al’s “explicit” instances came from milker.cafe?

And why didn’t the authors flag this Top Ten image sharing site as a Pleroma instance, not a Mastodon one? Could they not tell the difference?

The Denominator Problem

Finally, the authors note that the largest instances share the most “explicit” images. This raises the denominator problem (as Eric Jardine might put it): they don’t offer any denominator, so we have no idea what percentage we’re dealing with.

As a whole, the paper, in my opinion, violates Mastodon community norms without really contributing much of anything to science. The authors’ key findings are a) people share images on Mastodon; that b) many of those images come from outside of Mastodon (e.g., Twitter, Reddit); and c) some of those images are – in the judgement of Google’s Safe Search algorithm – explicit. Therefore, they argue, Mastodon needs moderation. As I see it, the authors are relying on an age-old trick to get published: gin up a moral panic about some new digital media practice.

I don’t think that these simplistic findings required such a breach of community norms and privacy policies.

What to do?

The question is: beyond my critiquing this paper, what do we do? I mentioned this paper on scholar.social and people have floated the idea of another open letter, similar to the one sent about Zignani et al.

But would it be effective? I took a look at the previous open letter and note that many of the remedies requested haven’t been achieved. While Harvard’s Dataverse did pull the Zignani et al dataset, the paper itself remains published and unretracted. Unlike the case of Zignani et al, Nobre et al appear to have not posted their dataset anywhere.

Unlike Zignani et al, however, Nobre et al don’t mention privacy considerations. It could be that an open letter to their institutions (Universidade Federal de Minas Gerais and Universidade Federal de Ouro Preto) might result in these institutions’ ethics boards to take a look at the work of Nobre et al. And since Nobre et al discuss future research plans involving media scraping from Mastodon, an open letter might prompt these researchers to do better.

ADDENDUM

A fellow Mastodon member noted that there is an option in Mastodon for users to avoid being indexed by search engines. In fact, I have it set on my account. So, for anyone whose posts were included in the Nobre et al study, they have been indexed by Google without their consent.

Post Tags

Comments

I'm afraid my fediverse-to-blog comment system hasn't been working for some time. I'm going to fix it... someday. In the meantime, I'll discuss these blog posts on Mastodon (I'm @robertwgehl@scholar.social) and you can also reach me at robertwgehl @ protonmail DOT com.

Reply