Simple LinkedIn Crawler using Playwright and BS4

Given a CSV file of company names, create a script that can find LinkedIn URLs for those companies. The LinkedIn URLs should be stored as a CSV file. Once that is done, extend the script using Playwright browser to find the employee count from LinkedIn.

Installation

Install the following libraries on your virtual env.

To create a virtual env type the following:

python -m venv vtel

Install the dependency librearies:

On the terminal, type:

pip install -r requirements.txt

Install playwright:

On the terminal, type:

playwright install

As information, this program uses the following libraries:

playwright
pytest-playwright
pytest
beautifulsoup4
lxml

Run tests

Run the following command on your terminal:

pytest

How to use it?

You need to add a valid username and password to login successfully and access the feed and people url paths. This can be added on the start.py file.

Once the credentials have been added, go to the root directory and run:

python start.py

The program does the following steps:

Read a companies.csv file where the companies are stored with a column called keywords to add more accurancy.
With that list, a login to LinkedIn will be attempted.
With the session on the page with the headless browser, a new request for each company will be done.
If the company is found, it will be added on a new list.
All the company LinkedIn links will be added on the linkedin_urls.csv file.
Using this last file, a new request will be done for each company to find the employees and associated members, once found, they will be appended to a new list.
A new file called company_employees.csv will be created with this information.

Things to consider

Avoid making too many requests or you will get a challenge. Last one was by voice and then a puzzle, any challenge to prove that you're a human.
A proxy could be used but it will increase the complexity of this challenge.
Use .env or yaml file to store credentials.
This script uses basic web scraping techniques and might not work perfectly if LinkedIn changes their website structure. Maybe a dedicated web scraping library like Scrapy for more robust and reliable scraping would be recommended.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
.vscode		.vscode
tests		tests
.gitignore		.gitignore
README.md		README.md
companies.csv		companies.csv
company_employees.csv		company_employees.csv
file_manager.py		file_manager.py
linkedin_scrapper.py		linkedin_scrapper.py
linkedin_urls.csv		linkedin_urls.csv
requirements.txt		requirements.txt
start.py		start.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple LinkedIn Crawler using Playwright and BS4

Installation

Install the dependency librearies:

Install playwright:

As information, this program uses the following libraries:

Run tests

How to use it?

The program does the following steps:

Things to consider

About

Releases

Packages

Languages

LuisSanchez/python-crawler-playwright

Folders and files

Latest commit

History

Repository files navigation

Simple LinkedIn Crawler using Playwright and BS4

Installation

Install the dependency librearies:

Install playwright:

As information, this program uses the following libraries:

Run tests

How to use it?

The program does the following steps:

Things to consider

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages