About

The tool enables broken link hijacking (also known as domain hijacking or domain takeover via dangling DNS records) by extracting links from Markdown files on GitHub.

This type of attack is well-known and widely practiced. An attacker targets a website—such as a forum with a significant amount of user-generated content—extracts links, identifies those that are broken due to domain expiration, and then registers these expired domains. The attacker subsequently hosts malicious payloads on the newly acquired domains. As a result, visitors who click on the previously legitimate links are exposed to the attacker's content.

The current tool focuses on GitHub repositories as a source of links. Many repositories are dedicated solely to maintaining lists of links, often serving as bookmark collections. Keeping such links up to date can be a challenging task, making them attractive targets for hijacking.

The tool contains three commands:

queryinfo
leech
check

We leave it to the user to choose a query string that selects repositories suitable for leeching and checking. In this README, we used the query string curated list, which identifies repositories likely to serve our purpose.

Requirements

You need to have github token for search code API (finding md files in repos). Put it in .env.

Example of .env:

GITHUB_TOKEN=ghp_UkbXXXXXXI3NsBD6XXXXXLWqjXXXA3DXXXr

Usage

Total number of repos

Discover how many repos are returned by the query:

npm run queryinfo -- "curated list"

The command will return total number of repos found by the query. Having this number you can define leeching params intelligently.

Leech data

The following command will leach data starting from page 10 and ending at (excluding) page 20. There are 100 items per result page. In this example, we leach (20-10)*100 = 1000 items.

npm run leacher -- "curated list" --perpage 100 --start 10 --end 20 --out md1

We want to continue leaching, then we run

npm run leacher -- "curated list" --perpage 100 --start 20 --end 30 --out md2

This will retrieve the next 1000 results.

Link checker

Now we can start to look for the broken links in the collected files.

npm run checker -- md1 | tee results.txt

This command will check md files from all subdirs in the OUTPUT folder. The results go to stdout.

How link checker works?

Get the link
On non-200 status, whois the tld extacted from the link

For example, we check an extracted link https://example.com/blog/article.html. We try to get the lin allowing 10 redirects. If the returned status is non-200, we whois example.com, a tld extracted from the link.

Output interpretation

Working on OUT1/matiassingers/awesome-readme/readme.md
https://azmr.xyz|https://azmr.xyz|error: getaddrinfo ENOTFOUND azmr.xyz
 -> Checking TLD with whois: azmr.xyz
 -> whois: The queried object does not exist: DOMAIN NOT FOUND
https://sbot.lol|https://sbot.lol|error: getaddrinfo ENOTFOUND sbot.lol
 -> Checking TLD with whois: sbot.lol
 -> whois: The queried object does not exist: DOMAIN NOT FOUND
https://www.braceriabutcherrecchia.rest|https://www.braceriabutcherrecchia.rest|error: getaddrinfo ENOTFOUND www.braceriabutcherrecchia.rest
 -> Checking TLD with whois: braceriabutcherrecchia.rest
 -> whois: >>> Domain braceriabutcherrecchia.rest is available for registration
https://cstack.github.io/db_tutorial/|https://cstack.github.io|error: Request failed with status code 404
 -> Checking TLD with whois: cstack.github.io
 -> whois: Malformed request.
https://realpython.com/blog/python/web-scraping-with-scrapy-and-mongodb/|https://realpython.com|error: Request failed with status code 403
 -> Checking TLD with whois: realpython.com
https://realpython.com/blog/python/rethink-flask-a-simple-todo-list-powered-by-flask-and-rethinkdb/|https://realpython.com|error: Request failed with status code 403
 -> Checking TLD with whois: realpython.com
https://machinelearningmastery.com/machine-learning-in-python-step-by-step/|https://machinelearningmastery.com|error: Request failed with status code 403
 -> Checking TLD with whois: machinelearningmastery.com
https://data-flair.training/blogs/advanced-python-project-detecting-fake-news/|https://data-flair.training|error: Request failed with status code 403
 -> Checking TLD with whois: data-flair.training
https://jeffknupp.com/blog/2014/09/01/what-is-a-nosql-database-learn-by-writing-one-in-python/|https://jeffknupp.com|error: getaddrinfo ENOTFOUND jeffknupp.com

We can see the first three domains are expired. We tried to GET them, we got error because no IP associated with the domain, we whoised the tld and we printed the response from the WHOIS server. For the first two domains, we got "DOMAIN NOT FOUND" and fir the third we got "available for registration". The responsed from the WHOIS servers indicate the domains are unregistered so we can regsiter them and use for link hacking.

The following links returned non-200 statuses. But whoising them showed the domains are registered, therefore the script did not print -> whois: data. These domains cannot be registered.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Requirements

Usage

Total number of repos

Leech data

Link checker

How link checker works?

Output interpretation

About

Releases

Packages

Languages

hpc20235/gitlinkhunter

Folders and files

Latest commit

History

Repository files navigation

About

Requirements

Usage

Total number of repos

Leech data

Link checker

How link checker works?

Output interpretation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages