bekkalokk/gitea: add robots.txt #97

Open
oysteikt wants to merge 1 commits from gitea-robots-txt into main
Owner
Ref https://git.pvv.ntnu.no/Drift/issues/issues/181
oysteikt force-pushed gitea-robots-txt from a040ef59a8 to c5c743d9af 2025-12-29 16:16:07 +01:00 Compare
oysteikt force-pushed gitea-robots-txt from c5c743d9af to f502a8ce4f 2025-12-29 16:34:31 +01:00 Compare
Owner

Why reintroduce bob?

Why reintroduce bob?
Author
Owner

likely a bad rebase

likely a bad rebase
oysteikt force-pushed gitea-robots-txt from f502a8ce4f to e9c82b4625 2026-01-29 05:09:46 +01:00 Compare
oysteikt added 1 commit 2026-05-25 05:19:02 +02:00
kommode/gitea: add robots.txt
Eval nix flake / evals (push) Successful in 4m1s
Eval nix flake / evals (pull_request) Successful in 4m0s
bc7d598fed
oysteikt force-pushed gitea-robots-txt from e9c82b4625 to bc7d598fed 2026-05-25 05:19:02 +02:00 Compare
Owner

I haven't tested or gone through the robots-txt module (💀 osv), but the content/lists looks fine

I haven't tested or gone through the robots-txt module (💀 osv), but the content/lists looks fine
Author
Owner

Do you remember the exact format you wanted for the output? I remember the main complaint was that we were using fancy syntax that all bots might not be able to parse. Instead of the following, how would you like it to look?

# Gitea internals
#
# See these for more information:
# - https://gitea.com/robots.txt
# - https://codeberg.org/robots.txt
#
User-agent: *
Disallow: /api/*
Disallow: /avatars
Disallow: /*/*/src/commit/*
Disallow: /*/*/commit/*
Disallow: /*/*/*/refs/*
Disallow: /*/*/*/star
Disallow: /*/*/*/watch
Disallow: /*/*/labels
Disallow: /*/*/activity/*
Disallow: /vendor/*
Disallow: /swagger.*.json
Disallow: /repo/create
Disallow: /repo/migrate
Disallow: /org/create
Disallow: /*/*/fork
Disallow: /*/*/watchers
Disallow: /*/*/stargazers
Disallow: /*/*/forks
Disallow: */.git/
Disallow: /*.git
Disallow: /*.atom
Disallow: /*.rss

# Language Spam
Disallow: /*?lang=

# AI bots
#
# Sourced from:
# - https://www.vg.no/robots.txt
# - https://codeberg.org/robots.txt
#
User-agent: AI2Bot
User-agent: Ai2Bot-Dolma
User-agent: Amazonbot
User-agent: Applebot-Extended
User-agent: Bytespider
User-agent: CCBot
User-agent: ChatGPT-User
User-agent: Claude-Web
User-agent: ClaudeBot
User-agent: Crawlspace
User-agent: Diffbot
User-agent: FacebookBot
User-agent: FriendlyCrawler
User-agent: GPTBot
User-agent: Google-Extended
User-agent: ICC-Crawler
User-agent: ImagesiftBot
User-agent: Kangaroo Bot
User-agent: Meta-ExternalAgent
User-agent: OAI-SearchBot
User-agent: Omgili
User-agent: Omgilibot
User-agent: PanguBot
User-agent: PerplexityBot
User-agent: PetalBot
User-agent: Scrapy
User-agent: SemrushBot-OCOB
User-agent: Sidetrade indexer bot
User-agent: Timpibot
User-agent: VelenPublicWebCrawler
User-agent: Webzio-Extended
User-agent: YouBot
User-agent: anthropic-ai
User-agent: cohere-ai
User-agent: cohere-training-data-crawler
User-agent: facebookexternalhit
User-agent: iaskspider/2.0
User-agent: img2dataset
User-agent: meta-externalagent
User-agent: omgili
User-agent: omgilibot
Disallow: /

Crawl-delay: 2

Sitemap: https://git.pvv.ntnu.no/sitemap.xml
Do you remember the exact format you wanted for the output? I remember the main complaint was that we were using fancy syntax that all bots might not be able to parse. Instead of the following, how would you like it to look? ``` # Gitea internals # # See these for more information: # - https://gitea.com/robots.txt # - https://codeberg.org/robots.txt # User-agent: * Disallow: /api/* Disallow: /avatars Disallow: /*/*/src/commit/* Disallow: /*/*/commit/* Disallow: /*/*/*/refs/* Disallow: /*/*/*/star Disallow: /*/*/*/watch Disallow: /*/*/labels Disallow: /*/*/activity/* Disallow: /vendor/* Disallow: /swagger.*.json Disallow: /repo/create Disallow: /repo/migrate Disallow: /org/create Disallow: /*/*/fork Disallow: /*/*/watchers Disallow: /*/*/stargazers Disallow: /*/*/forks Disallow: */.git/ Disallow: /*.git Disallow: /*.atom Disallow: /*.rss # Language Spam Disallow: /*?lang= # AI bots # # Sourced from: # - https://www.vg.no/robots.txt # - https://codeberg.org/robots.txt # User-agent: AI2Bot User-agent: Ai2Bot-Dolma User-agent: Amazonbot User-agent: Applebot-Extended User-agent: Bytespider User-agent: CCBot User-agent: ChatGPT-User User-agent: Claude-Web User-agent: ClaudeBot User-agent: Crawlspace User-agent: Diffbot User-agent: FacebookBot User-agent: FriendlyCrawler User-agent: GPTBot User-agent: Google-Extended User-agent: ICC-Crawler User-agent: ImagesiftBot User-agent: Kangaroo Bot User-agent: Meta-ExternalAgent User-agent: OAI-SearchBot User-agent: Omgili User-agent: Omgilibot User-agent: PanguBot User-agent: PerplexityBot User-agent: PetalBot User-agent: Scrapy User-agent: SemrushBot-OCOB User-agent: Sidetrade indexer bot User-agent: Timpibot User-agent: VelenPublicWebCrawler User-agent: Webzio-Extended User-agent: YouBot User-agent: anthropic-ai User-agent: cohere-ai User-agent: cohere-training-data-crawler User-agent: facebookexternalhit User-agent: iaskspider/2.0 User-agent: img2dataset User-agent: meta-externalagent User-agent: omgili User-agent: omgilibot Disallow: / Crawl-delay: 2 Sitemap: https://git.pvv.ntnu.no/sitemap.xml ```
Owner

No, I don't know, but this looks quite similar to https://codeberg.org/robots.txt, so it's probably fine.

Does the Crawl-delay: 2 (and sitemap, hah) only apply to the given AI bots? If so, mayhaps move it out/up to the global config for everyone.

No, I don't know, but this looks quite similar to https://codeberg.org/robots.txt, so it's probably fine. Does the `Crawl-delay: 2` (and sitemap, hah) only apply to the given AI bots? If so, mayhaps move it out/up to the global config for everyone.
Author
Owner

I don't think there is a global config? Like you mean the robot-txt module defaults? Or do we host a separate robots.txt directly on pvv.ntnu.no?

I don't think there is a global config? Like you mean the robot-txt module defaults? Or do we host a separate robots.txt directly on `pvv.ntnu.no`?
Some checks are pending
Eval nix flake / evals (push) Successful in 4m1s
Eval nix flake / evals (pull_request) Successful in 4m0s
This pull request can be merged automatically.
This branch is out-of-date with the base branch
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin gitea-robots-txt:gitea-robots-txt
git checkout gitea-robots-txt
Sign in to join this conversation.
No Reviewers
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Drift/pvv-nixos-config#97