Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for full_text in tweets; resolve #192. #252

Merged
merged 2 commits into from
Aug 10, 2018
Merged

Add support for full_text in tweets; resolve #192. #252

merged 2 commits into from
Aug 10, 2018

Conversation

ruebot
Copy link
Member

@ruebot ruebot commented Aug 10, 2018

GitHub issue(s):
#192

What does this Pull Request do?

Add support to the Tweet utility to use full_text from tweets.

How should this be tested?

I did this with Apache Spark 2.3.1:

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.util.TweetUtils._

// Load tweets from HDFS
val tweets = RecordLoader.loadTweets("/path/to/replies.jsonl", sc)

// Count them
tweets.count()

// Extract some fields
val r = tweets.map(tweet => (tweet.id, tweet.createdAt, tweet.username, tweet.text, tweet.fullText, tweet.lang,
                             tweet.isVerifiedUser, tweet.followerCount, tweet.friendCount))

// Take a sample of 10 on console
r.take(10)

// Exiting paste mode, now interpreting.

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.util.TweetUtils._
tweets: org.apache.spark.rdd.RDD[org.json4s.JValue] = MapPartitionsRDD[28] at filter at package.scala:65
r: org.apache.spark.rdd.RDD[(String, String, String, String, String, String, Boolean, Int, Int)] = MapPartitionsRDD[29] at map at <console>:50
res1: Array[(String, String, String, String, String, String, Boolean, Int, Int)] = Array((1004742470700322816,Thu Jun 07 15:10:46 +0000 2018,realDonaldTrump,"",When will people start saying, “thank you, Mr. President, for firing James Comey?”,en,true,52543149,46), (1005275688990052352,Sat Jun 09 02:29:36 +0000 2018,love4All_7,"",@realDonaldTrump Pleases look into my brothers case. Miskin Kamara, he's been locked up for 13+ years w...

If you want to use the same tweet set, it's here.

Additional Notes:

Once this is good to go, I'll take care of #194.

Interested parties

@ianmilligan1 @lintool

Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and works very well. Can merge once the lights turn green.

@codecov
Copy link

codecov bot commented Aug 10, 2018

Codecov Report

Merging #252 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #252      +/-   ##
==========================================
+ Coverage   70.11%   70.14%   +0.02%     
==========================================
  Files          41       41              
  Lines        1024     1025       +1     
  Branches      191      191              
==========================================
+ Hits          718      719       +1     
  Misses        240      240              
  Partials       66       66
Impacted Files Coverage Δ
...n/scala/io/archivesunleashed/util/TweetUtils.scala 92.3% <100%> (+0.64%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a5fa151...a4aba17. Read the comment docs.

@ianmilligan1 ianmilligan1 merged commit 62628b4 into archivesunleashed:master Aug 10, 2018
@ruebot ruebot deleted the issue-192 branch August 10, 2018 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants