Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return list of expanded URLs instead of just the string of only the first URL #14

Open
hauselin opened this issue Apr 15, 2024 · 2 comments

Comments

@hauselin
Copy link

Hi, I noticed in the line below the code only returns the first expanded URL (urls[0]), and returns it as a string. Often, there are multiple URLs, so it would be great if all expanded URLs in any given tweet were returned. See the suggestion below. Thanks!

expanded_url = urls[0].get('expanded_url', None) if urls else None

# current: returns string and returns only the first expanded URL
expanded_url = urls[0].get("expanded_url", None) if urls else None

# suggestion: returns list of all expanded URLs
expanded_urls = [url.get("expanded_url") for url in urls if url.get("expanded_url")]
@raindrift
Copy link
Collaborator

Agreed, this is the right thing. In the current version of the api spec, the urls are a list. The sample data generator doesn't insert multiple urls yet though, so I'll leave this issue open until it does.

@hauselin
Copy link
Author

@raindrift, I also noticed two additional/related issues in integration_test.py (and maybe also other places in the code base?). Right now, when running tests on Twitter, embedded_urls contain only shortened URLs and misses all the URLs in expanded_url in the raw data.

  1. The current re.findall solution will miss URLs that don't start with "https" or "http". I believe sometimes, on certain platforms, URLs might omit the http?
text = "https://www.facebook.com www.google.com github.com"  # 3 URLS 

# current
>>> re.findall(r"https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+", text)
['https://www.facebook.com']  # misses 2 of the 3 URLs

# more robust solution: https://urlextract.readthedocs.io/en/latest/urlextract.html
from urlextract import URLExtract
extractor = URLExtract()
>>> extractor.find_urls(text)
['https://www.facebook.com', 'www.google.com', 'github.com']  # gets all 3 URLS
  1. For Twitter, only shortened URLs are returned; the expanded URLs in expanded_url field isn't returned/included. For example, it seems like the current implementation of data_pull.py will always return None (row.get("expanded_url", None)) because expanded_url is inside entities and in a list of dictionaries. Have to extract/unpack them properly (see example code).

# Grab relevant fields
for _, row in sample.iterrows():
embedded_urls = []
if row.get("expanded_url", None):
embedded_urls.append(row["expanded_url"])

embedded_urls = []
urls = row.entities.get("urls", [])
for url in urls:
    expanded_url = url.get("expanded_url")
    if expanded_url:
        embedded_urls.append(expanded_url)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants