Return list of expanded URLs instead of just the string of only the first URL #14

hauselin · 2024-04-15T06:05:03Z

Hi, I noticed in the line below the code only returns the first expanded URL (urls[0]), and returns it as a string. Often, there are multiple URLs, so it would be great if all expanded URLs in any given tweet were returned. See the suggestion below. Thanks!

ranking-challenge/sample_data/preprocessing.py

Line 186 in 28b900e

expanded_url = urls[0].get('expanded_url', None) if urls else None

# current: returns string and returns only the first expanded URL
expanded_url = urls[0].get("expanded_url", None) if urls else None

# suggestion: returns list of all expanded URLs
expanded_urls = [url.get("expanded_url") for url in urls if url.get("expanded_url")]

The text was updated successfully, but these errors were encountered:

raindrift · 2024-04-16T22:38:15Z

Agreed, this is the right thing. In the current version of the api spec, the urls are a list. The sample data generator doesn't insert multiple urls yet though, so I'll leave this issue open until it does.

hauselin · 2024-04-17T18:29:06Z

@raindrift, I also noticed two additional/related issues in integration_test.py (and maybe also other places in the code base?). Right now, when running tests on Twitter, embedded_urls contain only shortened URLs and misses all the URLs in expanded_url in the raw data.

The current re.findall solution will miss URLs that don't start with "https" or "http". I believe sometimes, on certain platforms, URLs might omit the http?

text = "https://www.facebook.com www.google.com github.com"  # 3 URLS 

# current
>>> re.findall(r"https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+", text)
['https://www.facebook.com']  # misses 2 of the 3 URLs

# more robust solution: https://urlextract.readthedocs.io/en/latest/urlextract.html
from urlextract import URLExtract
extractor = URLExtract()
>>> extractor.find_urls(text)
['https://www.facebook.com', 'www.google.com', 'github.com']  # gets all 3 URLS

For Twitter, only shortened URLs are returned; the expanded URLs in expanded_url field isn't returned/included. For example, it seems like the current implementation of data_pull.py will always return None (row.get("expanded_url", None)) because expanded_url is inside entities and in a list of dictionaries. Have to extract/unpack them properly (see example code).

ranking-challenge/sample_data/data_pull.py

Lines 249 to 253 in ae88eb0

    
           # Grab relevant fields 
        
           for _, row in sample.iterrows(): 
        
               embedded_urls = [] 
        
               if row.get("expanded_url", None): 
        
                   embedded_urls.append(row["expanded_url"])

embedded_urls = []
urls = row.entities.get("urls", [])
for url in urls:
    expanded_url = url.get("expanded_url")
    if expanded_url:
        embedded_urls.append(expanded_url)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return list of expanded URLs instead of just the string of only the first URL #14

Return list of expanded URLs instead of just the string of only the first URL #14

hauselin commented Apr 15, 2024

raindrift commented Apr 16, 2024

hauselin commented Apr 17, 2024

Return list of expanded URLs instead of just the string of only the first URL #14

Return list of expanded URLs instead of just the string of only the first URL #14

Comments

hauselin commented Apr 15, 2024

raindrift commented Apr 16, 2024

hauselin commented Apr 17, 2024