Unsolved XPath Problem #533
-
Again, I tried and tried and tried for hours now to find the correct XPath "code" to scrape a value of a DD based on the name of the DT The Nth-child() would work, but unreliable because the web page DL is dynamically populated. My last try was with the "code"
I have attached a file which contains the HTML of the section in which the value can be found. Thank you so much for your help - this XPath issue is really very difficult to learn. |
Beta Was this translation helpful? Give feedback.
Replies: 9 comments 2 replies
-
Sorry, I forgot to upload the *.txt file - here it is |
Beta Was this translation helpful? Give feedback.
-
Have you tried to replace the |
Beta Was this translation helpful? Give feedback.
-
Hello Ahmad, I tried that too - result:
[image: image.png]
Am Mo., 6. Juni 2022 um 06:24 Uhr schrieb Ahmad Kholid <
***@***.***>:
… Have you tried to replace the */dt at the beginning with //dt?
—
Reply to this email directly, view it on GitHub
<#533 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANQ5JDLJVSASCWDVHOTDXVDVNV4ONANCNFSM5X6EC3WQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Ahmad, thank you very much.
Example website: https://www.linkedin.com/company/charles-schwab/about/
The section "About"
Best, Anton
Am Mo., 6. Juni 2022 um 06:30 Uhr schrieb Ahmad Kholid <
***@***.***>:
… If the data is dynamic it would be much easier to use javascript. Can you
maybe tell me which section of that website you want to extract and the
name of the column the date will be inserted into? so I can help you to
write the JS code
—
Reply to this email directly, view it on GitHub
<#533 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANQ5JDPAHNKIFBIWBUDRJKDVNV5EVANCNFSM5X6EC3WQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Hello Ahmad,
I kept on trying and I wanted to check if my XPath selector works on the
HTML - so I copied the HTML into *jsoup test* and used the XPath selector
***@***.***="mb6"]/div/div/section/dl/dt[1]/following-sibling::dd[1]
In *jsoup* it worked just fine - in Automa, too!
Nevertheless, the problem is that *dt[]* can show the content of several
headlines, it depends, what the user filled out
I tried to work with the JavaScript code you sent me for the BizBuySell
website, but it did not work out.
const element = document.querySelector('.odd:*nth-child(2)* > b');
const elementText = element.innerText; // 'Website', 'Phone', 'Industry',
'Company size', 'Headquarters', 'Founded', or 'Specialties'
const columnsName = {
'Website': *Website'
'Phone': 'Phone'
'Industry': 'Industry'
'Company size': 'Company size'
'Headquarters': 'Headquarters'
'Founded': 'Founded'
'Specialties': 'Specialties'
}
const column = columnsName[elementText];
automaNextBlock({ [column]: elementText })
I thought, one could create seven JavaScript blocks, one for each of the
columns, and exchange the nth-child[1], nth-child[2], nth-child[3],
nth-child[4], nth-child[5], nth-child[6] and nth-child[17]
I am sure, that you have a solution for this tricky problem.
Best, Anton
Am Mo., 6. Juni 2022 um 06:30 Uhr schrieb Ahmad Kholid <
***@***.***>:
… If the data is dynamic it would be much easier to use javascript. Can you
maybe tell me which section of that website you want to extract and the
name of the column the date will be inserted into? so I can help you to
write the JS code
—
Reply to this email directly, view it on GitHub
<#533 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANQ5JDPAHNKIFBIWBUDRJKDVNV5EVANCNFSM5X6EC3WQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Sorry for the late reply, you can try this javascript code const columns = [
{ text: 'Website', name: 'website' },
{ text: 'Phone', name: 'phone' },
{ text: 'Industry', name: 'industry' },
{ text: 'Company size', name: 'company_size' },
{ text: 'Headquarters', name: 'headquarters' },
{ text: 'Founded', name: 'founded' },
{ text: 'Specialties', name: 'specialties' }
]
const result = [];
const elements = document.querySelectorAll('dl dt');
elements.forEach((element) => {
const regex = new RegExp(element.innerText, 'i');
const findColumn = columns.find((column) => regex.test(column.text));
if (!findColumn) return;
const columnValue = element.nextElementSibling.innerText;
result.push({ [findColumn.name]: columnValue });
});
automaNextBlock(result) But before using the javascript block, you need to use the Element exist first to check whether the website is loaded with |
Beta Was this translation helpful? Give feedback.
-
Ahmad, thank you very much!
It works like clockwork!
Am Mi., 8. Juni 2022 um 00:45 Uhr schrieb Ahmad Kholid <
***@***.***>:
… Sorry for the late reply, you can try this javascript code
const columns = [
{ text: 'Website', name: 'website' },
{ text: 'Phone', name: 'phone' },
{ text: 'Industry', name: 'industry' },
{ text: 'Company size', name: 'company_size' },
{ text: 'Headquarters', name: 'headquarters' },
{ text: 'Founded', name: 'founded' },
{ text: 'Specialties', name: 'specialties' }]
const result = [];
const elements = document.querySelectorAll('dl dt');elements.forEach((element) => {
const regex = new RegExp(element.innerText, 'i');
const findColumn = columns.find((column) => regex.test(column.text));
if (!findColumn) return;
const columnValue = element.nextElementSibling.innerText;
result.push({ [findColumn.name]: columnValue });});
automaNextBlock(result)
But before using the javascript block, you need to use the Element exist
first to check whether the website is loaded with dl as the selector.
[image: image]
<https://user-images.githubusercontent.com/22908993/172495307-337cdce9-7dc0-41df-a7f7-e5f19d9adeba.png>
And in the workflow table
[image: image]
<https://user-images.githubusercontent.com/22908993/172495453-bdcd1fcf-801b-4085-9237-24b397975af9.png>
—
Reply to this email directly, view it on GitHub
<#533 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANQ5JDLSD2ZYUF2WSQXOMA3VN7GJXANCNFSM5X6EC3WQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Hello, again! Your code works just fine. Nevertheless, I have troubles in storing the results in an MS Excel sheet. The code's output are the column names that contain a value; it can be 1 column or up to all 7 columns. Output Option 1 Website | Phone | Industry | Company_size | Headquarters | Founded | Specialties If the code returns a value it is pushed to the correct column, if no return, the column value = n/a Output Option 2 Both option return a straight table structure that can be imported into an Excel sheet without problem and additional work. P.S.: I in addition to your code, I scrape 8 other values by using 8 different GetText Blocks. I tried to find solution in the various JavaScript informational websites, but without finding a solution. Could you please help me with this issue. |
Beta Was this translation helpful? Give feedback.
-
Thanks a million! It works great!
Am Fr., 10. Juni 2022 um 02:49 Uhr schrieb Ahmad Kholid <
***@***.***>:
… Try this code
const columns = [
{ text: 'Website', name: 'website' },
{ text: 'Phone', name: 'phone' },
{ text: 'Industry', name: 'industry' },
{ text: 'Company size', name: 'company_size' },
{ text: 'Headquarters', name: 'headquarters' },
{ text: 'Founded', name: 'founded' },
{ text: 'Specialties', name: 'specialties' }]
const row = {};
const elements = document.querySelectorAll('dl dt');const texts = Array.from(elements).map((el) => el.innerText);
columns.forEach((column) => {
const regex = new RegExp(column.text, 'i');
const textIndex = texts.findIndex((text) => regex.test(text));
const columnValue = textIndex === -1
? 'n/a'
: elements[textIndex].nextElementSibling.innerText;
row[column.name] = columnValue;});
automaNextBlock(row)
—
Reply to this email directly, view it on GitHub
<#533 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANQ5JDPN7VR54SZJCCKQN5LVOKGKPANCNFSM5X6EC3WQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
Sorry for the late reply, you can try this javascript code