Skip to content

Simple Web Scraping Framework, based on Curl and str* functions.

License

Notifications You must be signed in to change notification settings

laravieira/Scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scrapping

Simple Web Scraping Framework, based on Curl and str* functions. Requires PHP ^8.0 (can easily be downgraded to PHP 7)

What you can do?

There is some functions to simplify your script, they are listed below:

Global functions

  • upname => Format and return string to name formats (first each word character is uppercase)
  • price => Return float value of and price string of type 'xx$: 9.999,99'
  • accents => Replace accentuation with equivalent characters
  • strmstr => Return string after start3, after start2, after start1
  • strpart => Return middle string between start and end strings
  • strmpart => Return middle string between start2 and end string, which start2 is after start1

Scrapping object

How to use?

upname

upname(string $text);

Just pass the text string to format as parameter and the result will be the formated string. Some exemples bellow, the comment of each block represents the output:

Examples:

echo upname('lara vieira');
// 'Lara Vieira'
echo upname('LARA VIEIRA');
// 'Lara Vieira'
echo upname('LEONARDO DE CÁPRIO');
// 'Leonardo de Cáprio'
echo upname('DON PEDRO II');
// 'Don Pedro II'

price

price(string $text);

Just pass the text string to format as parameter and the result will be the float value. Some exemples bellow, the comment of each block represents the output:

echo price('US$: 3.567,56');
// 3567.56
echo price('R$: 3.456.234,45');
// 3456234.45
echo price('Price is R$: 234,45');
// 234.45

accents

accents(string $text);

Just pass the text to format as parameter and the result will be the formated string. An exemple bellow, the comment represent the output:

echo accents('Aglomeração, Apóstolo, vô, vó');
// 'Aglomeracao, Apostolo, vo, vo'

strmstr

strmstr(
    string $haystack, 
    string $start1, 
    string $start2, 
    string|null $start3=null
);

This function return all haystack string after start3 string that is after start2 string that is after start1 string (if start3 is passed) or all haystack string after start2 string that is after start1 string. The return will include the last start passed, like strstr.

This function is something like an stack of strstr functions:

strstr(strstr(strstr(haystack, start1), start2), start3)

Some exemples bellow, the comment of each block represents the output:

echo strmstr('ABC ABC ABC', 'C');
// 'C ABC ABC'
echo strmstr('ABC ABC ABC', 'B', 'A');
// 'ABC ABC'
echo strmstr('ABC ABC ABC', 'B', 'B', 'A');
// 'ABC'

strpart

* This is my favorite one for web-scrapping.

strpart(
    string $haystack,
    string|null $start = null,
    string|null $end = null,
    bool $keep_start = false
);

This function will return the middle string in haystack between the first occurence of start string and the first occurence of end string after start string.

  • If start string is null, will return everything in haystack before the first occurence of end string.

  • If end string is null, will return everything in haystack after the first occurence of start string.

  • If keep_start boolean is set to true, default is false, the function will return as normal, but including start string in the retrun's begin.

Some exemples bellow, the comment of each block represents the output:

echo strpart('ABC ABC ABC', ' ', ' ');
// 'ABC'
echo strpart('ABC ABC ABC', ' ');
// 'ABC ABC'
echo strpart('ABC ABC ABC', end:' ');
// 'ABC'
echo strpart('ABC ABC ABC', ' ', ' ', true);
// ' ABC'
echo strpart('<h2>Subtitle<h2>', '>', '<');
// 'Subtitle'
echo strpart('<div><div>Content</div></div>', '<div>', '</div>');
// '<div>Content'

strmpart

strmpart(
    string $haystack,
    string $start1,
    string $start2,
    string|null $end = null,
    bool $keep_start = false
);

This function solve the last example of strpart.

This function will return the middle string in haystack between the first occurence of start2 string, that one is after the first occurence of start1 string, and the first occurence of end string after start2 string.

  • If end string is null, will return everything in haystack after the first occurence of start2 string, after the first occurence of start1 string.

  • If keep_start boolean is set to true, default is false, the function will return as normal, but including start2 string in the retrun's begin.

Some exemples bellow, the comment of each block represents the output:

echo strmpart('<div><div>Content</div></div>', '<div>', '<div>', '</div>');
// 'Content'
echo strmpart('<div><div>Content</div></div>', '>', '>', '<');
// 'Content'
echo strmpart('<a id="link1"><h2 id="text1">Content</h2></a>', '<h2', 'id="', '"');
// 'text1'

About

Simple Web Scraping Framework, based on Curl and str* functions.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages