Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove some strings #29

Open
NBibikov opened this issue Jul 11, 2015 · 4 comments
Open

Remove some strings #29

NBibikov opened this issue Jul 11, 2015 · 4 comments

Comments

@NBibikov
Copy link

Hi! Please help me. I read docs but don't understand how remove some strings. I have some html strings with different parts(aHirg7S8Zu0):

<p><img src="//img.youtube.com/vi/aHirg7S8Zu0/0.jpg" height="505" width="640"></p>
<p>&nbsp;</p>
<h2 style="text-align: center;">Dear parents, I want say you...</h2>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sit sane ista voluptas. Aliter autem vobis placet. Fortemne possumus dicere eundem illum Torquatum? Duo Reges: constructio interrete. Igitur neque stultorum quisquam beatus neque sapientium non beatus.<br>
<p>&nbsp;</p>

How i can delete first line and all nbsp(2-nd line)?

1. <p><img src="//img.youtube.com/vi/aHirg7S8Zu0/0.jpg" height="505" width="640"></p> 
2. <p>&nbsp;</p>

Thank you very much

@nolanw
Copy link
Owner

nolanw commented Jul 13, 2015

There are fundamentally two ways to go about it: focus on the content to keep; or discard unwanted content. I'm not sure which one makes more sense in the context you've given, so I'll describe both.

If you choose to focus on the content to keep, it looks like you're interested in the header and the paragraph thereafter. So you could do something like:

HTMLDocument *document = /* load a document */;
HTMLElement *h2 = [document firstNodeMatchingSelector:@"h2"];
HTMLElement *relevantParagraph = [document firstNodeMatchingSelector:@"h2 + p"];

If you choose to discard unwanted content, you might do something like:

HTMLDocument *document = /* load a document */;
HTMLElement *img = [document firstNodeMatchingSelector:@"p > img"];
HTMLElement *imageParagraph = img.parentElement;
// Grab the parent of all these paragraphs for later.
HTMLElement *parent = imageParagraph.parentElement;
[imageParagraph removeFromParentNode];
for (HTMLElement *child in parent.children) {
  // U+00A0 is non-breaking space, aka &nbsp;
  if ([child.tagName isEqualToString:@"p"] &&
      [child.textContent isEqualToString:@"\u00a0"])
  {
    [child removeFromParentNode];
  }
}

These examples lean pretty heavily on assuming your document looks exactly like the context you've provided here, so you might need to make it a bit more general.

Does that make sense?

@NBibikov
Copy link
Author

Thank you very much! I will experiment two options

@nolanw
Copy link
Owner

nolanw commented Sep 20, 2015

@NBibikov did you ever solve this?

@sujeet14108
Copy link

Hi
Actually there is no problem in your code.
"nbsp;" this thing would not be shown on the output page although you can simply delete it. Some edition put this on the time of declaration

and also the image source url would not be printed as such .

screenshot 6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants