diff --git a/docs/images/CMD.png b/docs/images/CMD.png new file mode 100644 index 00000000..8ed68e14 Binary files /dev/null and b/docs/images/CMD.png differ diff --git a/docs/images/echidna_csv.png b/docs/images/echidna_csv.png new file mode 100644 index 00000000..757fed86 Binary files /dev/null and b/docs/images/echidna_csv.png differ diff --git a/docs/images/excel_echidna.png b/docs/images/excel_echidna.png new file mode 100644 index 00000000..16779207 Binary files /dev/null and b/docs/images/excel_echidna.png differ diff --git a/docs/images/json_echidna.png b/docs/images/json_echidna.png new file mode 100644 index 00000000..ee7e2ee2 Binary files /dev/null and b/docs/images/json_echidna.png differ diff --git a/docs/images/pip_install.png b/docs/images/pip_install.png new file mode 100644 index 00000000..adf3310f Binary files /dev/null and b/docs/images/pip_install.png differ diff --git a/docs/images/twarc_configure.png b/docs/images/twarc_configure.png new file mode 100644 index 00000000..2648e010 Binary files /dev/null and b/docs/images/twarc_configure.png differ diff --git a/docs/images/twarc_count_echidna.png b/docs/images/twarc_count_echidna.png new file mode 100644 index 00000000..69886ac3 Binary files /dev/null and b/docs/images/twarc_count_echidna.png differ diff --git a/docs/images/twarc_help.png b/docs/images/twarc_help.png new file mode 100644 index 00000000..9d1816e9 Binary files /dev/null and b/docs/images/twarc_help.png differ diff --git a/docs/images/twarc_progress_download.png b/docs/images/twarc_progress_download.png new file mode 100644 index 00000000..399ed20a Binary files /dev/null and b/docs/images/twarc_progress_download.png differ diff --git a/docs/images/win_installer.png b/docs/images/win_installer.png new file mode 100644 index 00000000..03bf2d7b Binary files /dev/null and b/docs/images/win_installer.png differ diff --git a/docs/tutorial.md b/docs/tutorial.md index 6ad61f93..a377a4b6 100644 --- a/docs/tutorial.md +++ b/docs/tutorial.md @@ -5,7 +5,7 @@ Twarc is a command line tool for collecting Twitter data via Twitter's web Appli By the end of this tutorial, you will have: 1. Familiarised yourself with interacting with a command line application via a terminal -2. Setup twarc so you can collect data from the Twitter API (version 2) +2. Setup Twarc so you can collect data from the Twitter API (version 2) 3. Constructed two Twitter search queries to address a specific research question 4. Collected data for those two queries 5. Processed the collected data into formats suitable for other analysis @@ -17,36 +17,27 @@ By the end of this tutorial, you will have: This tutorial is built around collecting data from Twitter to address the following research question: - Which monotreme is currently the coolest - the echidna or the platypus? - -We'll answer this question with a simple quantitative approach to analysing the collected data, by counting the volume of likes that tweets mentioning each species of animal accrue. For the purposes of this tutorial, the species that gets the most likes on tweets is going to be considered the "coolest". Of course this is a very simplistic quantitative approach so that this tutorial can get you started on collecting and analysing data. To seriously study the relative coolness of monotremes there are a wide variety of more appropriate (but also more involved) methods. +***Which monotreme is currently the coolest - the echidna or the platypus?*** +We'll answer this question with a simple quantitative approach to analysing the collected data: counting the volume of likes that tweets mentioning each species of animal accrue. For this tutorial, the species that gets the most likes on tweets is going to be considered the "coolest". This is a very simplistic quantitative approach, just to get you started on collecting and analysing Twitter data. To seriously study the relative coolness of monotremes, there are a wide variety of more appropriate (but also more involved) methods. ## Introduction to twarc and the Twitter API ### What is an API? -An Application Programming Interface (API) is a common method for software applications and services -to allow other systems or people to programmatically interact with their system. For example, -Twitter has an API which allows external systems to make requests to Twitter for information or -actions. Twitter (and many other web apps and services) use an HTTP REST API, meaning that to interact -with Twitter through the API you can send an HTTP request to a specific URL (also known as an endpoint) provided by Twitter, and -Twitter will respond with a bundle of information in JSON format for you. +An **Application Programming Interface** (API) is a common method for software applications and services to allow other systems or people to programmatically interact with them. For example, Twitter has an API which allows external systems to make requests to Twitter for information or actions. Twitter (and many other web apps and services) uses an HTTP REST API, meaning that to interact with Twitter through the API you can send an HTTP request to a specific URL (also known as an **endpoint**) provided by Twitter, and Twitter will respond with a bundle of information in JSON format for you. -Twarc acts as a tool or an intermediary for you to use so that you don't have to manage the details -of how exactly to make requests to the Twitter API and handle Twitter's responses. Twarc commands -correspond roughly with Twitter API endpoints. For example, when you use Twarc to fetch the timeline of a specific -twitter account (we'll use @Twitter in this example), this is the sequence of events: +Twarc acts as a tool or an intermediary for you to interact with Twitter API, so that you don't have to manage the details of how exactly to make requests to the Twitter API and handle Twitter's responses. Twarc commands correspond roughly with Twitter API endpoints. For example, when you use Twarc to fetch the timeline of a specific Twitter account (we'll use @Twitter in this example), this is the sequence of events: 1. You run `twarc2 timeline Twitter tweets.jsonl` -2. twarc2 makes a request on your behalf to the [Twitter v2 user lookup API endpoint](https://developer.twitter.com/en/docs/twitter-api/users/lookup/introduction) - in order to find the user ID for the @Twitter account, and receives a response from the Twitter API server with that user ID -3. twarc2 makes a request on your behalf to the [Twitter v2 timeline API endpoint](https://developer.twitter.com/en/docs/twitter-api/tweets/timelines/introduction), - using the user ID determined in step 2, and receives a response (or several responses) from the Twitter API server with @Twitter's tweets + +2. twarc2 makes a request on your behalf to the [Twitter v2 user lookup API endpoint](https://developer.twitter.com/en/docs/twitter-api/users/lookup/introduction) in order to find the user ID for the @Twitter account, and receives a response from the Twitter API server with that user ID + +3. twarc2 makes a request on your behalf to the [Twitter v2 timeline API endpoint](https://developer.twitter.com/en/docs/twitter-api/tweets/timelines/introduction), using the user ID determined in step 2, and receives a response (or several responses) from the Twitter API server with @Twitter's tweets + 4. twarc2 consolidates the timeline responses from step 3 and outputs them according to your initial command, in this case as `tweets.jsonl` -There are a great many resources on the internet to learn more about APIs more generally and how to use them in a -variety of contexts. Here are a few introductory articles: +There are a great many resources on the internet to learn more about APIs more generally and how to use them in a variety of contexts. Here are a few introductory articles: - [How to Geek: What is an API, and how do developers use them?](https://www.howtogeek.com/343877/what-is-an-api/) - [IBM: What is an API?](https://www.ibm.com/cloud/learn/api) @@ -58,7 +49,7 @@ More detailed information on APIs and working with them: ### What can you do with the Twitter API? -The Twitter API is very popular in academic communities for good reason - it is one of the most accessible and research-friendly of the popular social media platforms at present. The Twitter API is well-established and offers a broad range of possibilities for data collection. +The Twitter API is very popular in academic communities for good reason: it is one of the most accessible and research-friendly of the popular social media platforms at present. The Twitter API is well-established and offers a broad range of possibilities for data collection. Here are some examples of things you can do with the Twitter API: @@ -70,15 +61,14 @@ Here are some examples of things you can do with the Twitter API: - Map Twitter account followers and followees within or around a group of users - Trace conversations and interactions around users or tweets of interest -You may notice as you read about the Twitter API that there are two versions of the Twitter API - version 1.1 and version 2. At the time of writing, -Twitter is providing both versions of the API, but at some unknown point in the future version 1.1 may be discontinued. Twarc can handle either API version: the `twarc` command uses version 1.1 of the Twitter API, the `twarc2` command uses version 2. Take care when reading documentation and tutorials as to which Twitter API version is being referenced. This tutorial uses version 2 of the Twitter API. +You may notice as you read about the Twitter API that there are two versions of the Twitter API - version 1.1 and version 2. At the time of writing, Twitter is providing both versions of the API, but at some unknown point in the future version 1.1 may be discontinued. Twarc can handle either API version: the `twarc` command uses version 1.1 of the Twitter API, the `twarc2` command uses version 2. Take care when reading documentation and tutorials as to which Twitter API version is being referenced. **This tutorial uses version 2 of the Twitter API**. Twitter API endpoints can be structured either around tweets or around user accounts. For example, the search endpoint provides lists of tweets - user information is included, but the data is focused on the tweets. The available endpoints and their details are evolving as Twitter develops and releases its API v2, for the most up to date information refer to [the Twitter API documentation](https://developer.twitter.com/en/docs/twitter-api). Some of the most used endpoints for research purposes are: - [search](https://developer.twitter.com/en/docs/twitter-api/tweets/search/introduction): This is the endpoint used to search tweets, whether recent or historical. -- [lookup](https://developer.twitter.com/en/docs/twitter-api/tweets/lookup/introduction): The lookup endpoints are useful when you have IDs of tweets of interest and want to fetch further data about those tweets - known in the Twarc community as *hydrating* the tweets. +- [lookup](https://developer.twitter.com/en/docs/twitter-api/tweets/lookup/introduction): The lookup endpoints are useful when you have IDs of tweets of interest and want to fetch further data about those tweets - known in the Twarc community as **hydrating** the tweets. - [follows](https://developer.twitter.com/en/docs/twitter-api/users/follows/introduction): The follows endpoint allows collecting information about who follows who on Twitter With the Twitter API, you can get data related to all types of objects that make up the Twitter experience, including [tweets](https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet) and [users](https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/user). The Twitter documentation provides full details, and these two pages are very useful to bookmark! @@ -91,13 +81,16 @@ The Twitter documentation also provides some useful tools for constructing searc The rest of this tutorial is going to focus on using the Twitter search API to retrieve tweets containing content relevant to the research question. We've chosen to focus on this because: 1. With the rich functionality available in the search API the data collection for many projects can be condensed down to a few carefully chosen searches. -2. With the academic access track it's possible to search the entire Twitter archive, making search uniquely powerful among the endpoints Twitter supports. +2. With the academic access track it's possible to search the entire Twitter archive, making search uniquely powerful among the endpoints Twitter supports. ### Introduction to twarc -Twarc is at its core an application for interacting with the Twitter API, reading results from the different functionality the API offers, and safely writing the collected data to your machine for further analysis. Twarc handles the mechanical details of interacting with the Twitter API like including information to authenticate yourself, making HTTP requests to the API, formatting data in the right way, and retrying when things on the internet fail. Your job is to work out 1.) Which endpoint you want to call on from the Twitter API 2.) Which data you want to retrieve from that endpoint. +Twarc is at its core an application for interacting with the Twitter API, reading results from the different functionality the API offers, and safely writing the collected data to your machine for further analysis. Twarc handles the mechanical details of interacting with the Twitter API like including information to authenticate yourself, making HTTP requests to the API, formatting data in the right way, and retrying when things on the internet fail. Your job is to work out + +1. Which endpoint you want to call on from the Twitter API +2. Which data you want to retrieve from that endpoint. -Twarc is a command line based application - to use twarc you type a command specifying a particular action, and the results of that command are shown as text on screen. If you haven't used a command line interface before, don't worry! Although there is a bit of a learning curve at the beginning, you will quickly get the hang of it - and because everything is a typed command, it is very easy to record and share _exactly_ how you collected data with other people. +Twarc is a command line based application - to use twarc you type a command specifying a particular action, and the results of that command are shown as text on screen. If you haven't used a command line interface before, don't worry! Although there is a bit of a learning curve at the beginning, you will quickly get the hang of it - and because everything is a typed command, it is very easy to record and share _exactly_ how you collected data with other people. ## Considerations when using social media data for research @@ -138,15 +131,15 @@ Twarc is a command line application, written in the Python programming language. [Start here](https://developer.twitter.com/en/apply-for-access) to apply for a Twitter developer account and follow the steps in [our developer access guide](twitter-developer-access.md). For this tutorial, you can skip step 2, as we won't require academic access. -Once you have the `bearer_token` you are ready for the next step. This token is like a password, so you shouldn't share it with other people. You will also need to be able to enter this token once to configure Twarc, so it would be best to copy and paste it to a text file on your local machine until we've finished configuration. +Once you have the **Bearer Token**, you are ready for the next step. This token is like a password, so you shouldn't share it with other people. You will also need to be able to enter this token once to configure Twarc, so it would be best to copy and paste it to a text file on your local machine until we've finished configuration. ### Install Python #### Windows -Install the latest version [for Windows](https://www.python.org/downloads/windows/). You will need to set the path, as shown in the screenshot below. +Install the latest version [for Windows](https://www.python.org/downloads/windows/). During the installation, make sure the *Add Python to PATH* option is selected. -![Screenshot showing the path selection settings on window]() +![](images/win_installer.png) #### Mac @@ -157,28 +150,29 @@ Install the latest version [for Mac](https://www.python.org/downloads/macos/). N For this tutorial we're going to install three Python packages, `twarc`, an extension called `twarc-csv`, and `pandas`, a Python library for data analysis. We will use a command line interface to install these packages. On Windows we will use the `cmd` console, which can be found by searching for `cmd` from the start menu - you should see a prompt like the below screenshot. On Mac you can open the `Terminal` app. -![Screenshot showing the opening of the cmd window on windows]() -![Screenshot showing the open cmd prompt]() +![Screenshot showing the opening of the cmd window on windows](images/CMD.png) Once you have a terminal open we can run the following command to install the necessary packages: -> pip install twarc twarc-csv pandas +```shell +pip install twarc twarc-csv pandas +``` You should see output similar to the following: -![Screenshot showing the output of installing twarc and twarc-csv]() +![](images/pip_install.png) -### Our first command/making sure everything is working +### Our first command: making sure everything is working Let's open a terminal and get started - just like when installing twarc, you will want to use the `cmd` application on windows and the `Terminal` application on Mac. The first command we want to run is to check if everything in twarc is installed and working correctly. We'll use twarc's builtin `help` for this. Running the following command should show you a brief overview of the functionality that the twarc2 command provides and some of the options available: -``` +```shell twarc2 --help ``` -![Screenshot showing the default help output on windows]() +![](images/twarc_help.png) Twarc is structured like many other command line applications: there is a single main command, `twarc2`, to launch the application, and then you provide a subcommand, or additional arguments, or flags to provide additional context about what that command should actually do. In this case we're only launching the `twarc2` command, and providing a single _flag_ `--help` (the double-dash syntax is usually used for this). Most terminal applications will have a `--help` or `-h` flag that will provide some useful information about the application you're running. This often includes example usage, options, and a short description. @@ -197,7 +191,7 @@ twarc2 configure On running this command twarc will prompt us to paste our bearer token, as shown in the screenshot below. After entering our token, we will be prompted to enter additional information - this is not necessary for this tutorial, so we will skip this step by typing the letter `n` and hitting `enter`. -![Redacted output of what twarc configure looks like]() +![](images/twarc_configure.png) ## Introduction to Twitter search and counts @@ -211,68 +205,123 @@ There are two key commands that the Twitter API provides for search: a `search` - the count and trend over time is useful in and of itself - if you accidentally search for the wrong thing you can consume your monthly quota of tweets without collecting anything useful -Let's get started with the `counts` API - in twarc this is accessible by the command `counts`. As before `twarc2` is our entry command, `counts` is the subcommand we're interested in, and the `echidna` is what we're interested in searching for on Twitter (the query). +Let's get started with the `counts` API - in twarc this is accessible by the command `counts`. As before `twarc2` is our entry command, `counts` is the subcommand we're interested in, and the `echidna` is what we're interested in searching for on Twitter (the query). -`twarc2 counts echidna` +```shell +twarc2 counts echidna +``` You should see something like the below screenshot - and yes, this output isn't very readable! By default twarc shows us the response in the JSON format directly from the Twitter API, so it's not great for using directly on the command line. -![screenshot of the first command run in a query]() +![](images/twarc_count_echidna.png) Let's improve this by updating our command to: -`twarc2 counts echidna --text --granularity day` +```shell +twarc2 counts echidna --text --granularity day +``` And we should see output like below. Note that the `--text` and `--granularity` are optional flags provided to the `twarc2 counts` command, we can see other options by running `twarc2 counts --help`. In this case `--text` returns a simplified text output for easier reading, and `--granularity day` is passed to the Twitter API to specify that we're interested only in daily counts of tweets, not the default hourly count. -(table of results) +```shell +2022-11-03T02:49:02.000Z - 2022-11-04T00:00:00.000Z: 974 +2022-11-04T00:00:00.000Z - 2022-11-05T00:00:00.000Z: 802 +2022-11-05T00:00:00.000Z - 2022-11-06T00:00:00.000Z: 527 +2022-11-06T00:00:00.000Z - 2022-11-07T00:00:00.000Z: 554 +2022-11-07T00:00:00.000Z - 2022-11-08T00:00:00.000Z: 883 +2022-11-08T00:00:00.000Z - 2022-11-09T00:00:00.000Z: 723 +2022-11-09T00:00:00.000Z - 2022-11-10T00:00:00.000Z: 1,567 +2022-11-10T00:00:00.000Z - 2022-11-10T02:49:02.000Z: 219 +``` -Note that this is only the count for the last seven days - this is the level of search functionality available for all developers via the standard track of the Twitter API. If you have access to the [Twitter Academic track](https://developer.twitter.com/en/use-cases/do-research/academic-research), you can switch to searching the full Twitter archive from the `counts` and `search` commands by adding the `--archive` flag. +Note that this is only the count for the last seven days, which is the level of search functionality available for all developers via the standard track of the Twitter API. If you have access to the [Twitter Academic track](https://developer.twitter.com/en/use-cases/do-research/academic-research), you can switch to searching the full Twitter archive from the `counts` and `search` commands by adding the `--archive` flag. -Twitter search is powerful and provides many rich options. However, it also functions a little differently to most other search engines, because Twitter search does not focus on _ranking_ tweets by relevance (like a web search engine does). Instead, Twitter search via the API focuses on retrieving all matching tweets in chronological order. In other words, Twitter search uses the [Boolean model of searching](https://nlp.stanford.edu/IR-book/html/htmledition/boolean-retrieval-1.html), and returns the documents that match exactly what you provide and nothing else. +Twitter search is powerful and provides many rich options. However, it also functions a little differently to most other search engines, because Twitter search does not focus on _ranking_ tweets by relevance (like a web search engine does). Instead, Twitter search via the API focuses on retrieving all matching tweets in chronological order. In other words, Twitter search uses the [Boolean model of searching](https://nlp.stanford.edu/IR-book/html/htmledition/boolean-retrieval-1.html), and returns the documents that match exactly what you provide and nothing else. Let's work through this example a little further, first we want to expand to capture more variant's of the word echidna - note that Twitter search via the API matches on the whole word, so `echidna` and `echidnas` are different. You can also see that we've added some double quotes around our - without these quotes the individual pieces of our query might be interpreted as additional arguments to our search command: -`twarc2 counts "echidna echidna's echidnas" --granularity day --text` +```shell +twarc2 counts "echidna echidna's echidnas" --granularity day --text +``` -(table of results) +```console +2022-11-03T03:40:44.000Z - 2022-11-04T00:00:00.000Z: 0 +2022-11-04T00:00:00.000Z - 2022-11-05T00:00:00.000Z: 0 +2022-11-05T00:00:00.000Z - 2022-11-06T00:00:00.000Z: 0 +2022-11-06T00:00:00.000Z - 2022-11-07T00:00:00.000Z: 0 +2022-11-07T00:00:00.000Z - 2022-11-08T00:00:00.000Z: 0 +2022-11-08T00:00:00.000Z - 2022-11-09T00:00:00.000Z: 0 +2022-11-09T00:00:00.000Z - 2022-11-10T00:00:00.000Z: 0 +2022-11-10T00:00:00.000Z - 2022-11-10T03:40:44.000Z: 0 +``` Suddenly we're retrieving very few results! By default, if you don't specify an operator, the Twitter API assumes you mean AND, or that all of the words should be present - we will need to explicitly say that we want any of these words using the OR operator: -`twarc2 counts "echidna OR echidna's OR echidnas" --granularity day --text` +```shell +twarc2 counts "echidna OR echidna's OR echidnas" --granularity day --text +``` -(table of results) +```console +2022-11-03T03:42:10.000Z - 2022-11-04T00:00:00.000Z: 964 +2022-11-04T00:00:00.000Z - 2022-11-05T00:00:00.000Z: 846 +2022-11-05T00:00:00.000Z - 2022-11-06T00:00:00.000Z: 552 +2022-11-06T00:00:00.000Z - 2022-11-07T00:00:00.000Z: 573 +2022-11-07T00:00:00.000Z - 2022-11-08T00:00:00.000Z: 962 +2022-11-08T00:00:00.000Z - 2022-11-09T00:00:00.000Z: 758 +2022-11-09T00:00:00.000Z - 2022-11-10T00:00:00.000Z: 1,591 +2022-11-10T00:00:00.000Z - 2022-11-10T03:42:10.000Z: 288 +``` We can also apply operators based on other content or properties of tweets (see more [search operators](https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query#list) in the Twitter API documentation). Because we're deciding to focus on the number of likes on tweets as our measure of coolness, we want to exclude retweets. If we don't exclude retweets, our like measure might be heavily influenced by one highly retweeted tweet. We can do this using the `-` (minus) operator, which allows us to exclude tweets matching a criteria, in conjunction with the `is:retweet` operator, which filters on whether the tweet is a retweet or not. If we applied just the `is:retweet` operator we'd only see the retweets, the opposite of what we want. -`twarc2 counts "echidna OR echidna's OR echidnas -is:retweet" --granularity day --text` +```shell +twarc2 counts "echidna OR echidna's OR echidnas -is:retweet" --granularity day --text +``` -(table of results) +```text +2022-11-03T03:43:02.000Z - 2022-11-04T00:00:00.000Z: 957 +2022-11-04T00:00:00.000Z - 2022-11-05T00:00:00.000Z: 826 +2022-11-05T00:00:00.000Z - 2022-11-06T00:00:00.000Z: 546 +2022-11-06T00:00:00.000Z - 2022-11-07T00:00:00.000Z: 570 +2022-11-07T00:00:00.000Z - 2022-11-08T00:00:00.000Z: 931 +2022-11-08T00:00:00.000Z - 2022-11-09T00:00:00.000Z: 750 +2022-11-09T00:00:00.000Z - 2022-11-10T00:00:00.000Z: 1,587 +2022-11-10T00:00:00.000Z - 2022-11-10T03:43:02.000Z: 288 +``` There's one tiny gotcha from the Twitter API here, which is important to know about. AND operators are applied before OR operators, even if the AND is not specified by the user. The query we wrote above actually means something like below. We're only removing the retweets containing the word "echidnas", not all retweets: -`echidna OR echidna's OR (echidnas AND -is:retweet)` +``` +echidna OR echidna's OR (echidnas AND -is:retweet) +``` We can make our intent explicit by adding parantheses to group terms. This is a good idea in general to make your meaning clear, even if you know all of the operator rules. -`twarc2 counts "(echidna OR echidna's OR echidnas) -is:retweet" --granularity day --text` +```shell +twarc2 counts "(echidna OR echidna's OR echidnas) -is:retweet" --granularity day --text +``` Now for the purposes of this tutorial we're going to stop exploring any further, but we could continue to refine and improve this query to match our research question. Twitter lets you build very long queries (up to 512 characters on the standard track and 1024 for the academic track) so you have plenty of scope to express yourself. If we apply the same kind of process to the platypus case, we might end up with something like the following. In this case it was necessary to use the [Twitter search web interface](https://twitter.com/explore) to find some of the variations in the word platypus: -`twarc2 counts "(platypus OR platpus's OR platypi OR platypusses OR platypuses) -is:retweet" --granularity day --text` +```shell +twarc2 counts "(platypus OR platpus's OR platypi OR platypusses OR platypuses) -is:retweet" --granularity day --text +``` Having decided on the actual queries to run and examined the counts, now it's time to actually collect the tweets! We can take the queries we ran earlier, replace the `counts` command with the `search` and remove the `counts` specific arguments to get: -`twarc2 search "(echidna OR echidna's OR echidnas) -is:retweet" echidna.json` -`twarc2 search "(platypus OR platpus's OR platypi OR platypusses OR platypuses) -is:retweet" platypus.json` +```shell +twarc2 search "(echidna OR echidna's OR echidnas) -is:retweet" echidna.json + +twarc2 search "(platypus OR platpus's OR platypi OR platypusses OR platypuses) -is:retweet" platypus.json +``` Running these two commands will save the tweets matching each of those searches to two files on our disk, which we will use for the next sessions. While saving the -![Screenshot showing the progress of the tweets being downloaded]() +![Screenshot showing the progress of the tweets being downloaded](images/twarc_progress_download.png) TIP: if you're not sure where the files above have been saved, you can run the command `cd` on Windows, or `pwd` on Mac to have your shell print out the folder in the filesystem where twarc has been working. @@ -280,29 +329,31 @@ TIP: if you're not sure where the files above have been saved, you can run the c Now that we've collected some data, it's time to take a look at it. Let's start by viewing the collected data in it's plainest form: as a text file. Although we named the file with an extension of `.json`, this is just a convention: the actual file content is a plain text in the [JSON](https://en.wikipedia.org/wiki/JSON) format. Let's open this file with our inbuilt text editor (notepad on Windows, ... on Mac). -![Screenshot of the json file in notepad]() +![Screenshot of the json file in notepad](images/json_echidna.png) You'll notice immediately that there is a *lot* of data in that file: tweets are rich objects, and we mentioned that twarc by default captures as much information as Twitter makes available. Further, the Twitter API provides data in a format that makes it convenient for machines to work with, but not so much for humans. ## Making a CSV file from our collected tweets -We don't recommend trying to manually parse this raw data unless you have specific needs that aren't covered by existing tools. So we're going to use the `twarc-csv` package that we installed earlier to do the heavy lifting of transforming the collected JSON into a more friendly comma-separated value ([CSV](https://en.wikipedia.org/wiki/Comma-separated_values)) file. CSV is a simple plaintext format, but unlike JSON format is easy to import or open with a spreadsheet. +We don't recommend trying to manually parse this raw data unless you have specific needs that aren't covered by existing tools. So we're going to use the `twarc-csv` package that we installed earlier to do the heavy lifting of transforming the collected JSON into a more friendly comma-separated value ([CSV](https://en.wikipedia.org/wiki/Comma-separated_values)) file. CSV is a simple plaintext format, but unlike JSON format is easy to import or open with a spreadsheet. The `twarc-csv` package lets us use a `csv` command to transform the files from twarc: -``` +```shell twarc2 csv echidna.json echidna.csv + twarc2 csv platypus.json platypus.csv ``` -If we look at these files in our text editor again, we'll see a nice structure of one line per tweet, with all of the many columns for that tweet. +If we look at these files in our text editor again, we'll see a nice structure of one line per tweet, with all of the many columns for that tweet. -![Screenshot of the plaintext CSV file in notepad]() +![Screenshot of the plaintext CSV file in notepad](images/echidna_csv.png) -Since we're going to do more analysis with the Pandas library to answer our question, we will want to create the CSV with only the columns of interest. This will reduce the time and amount of RAM you need to load your dataset. For example, the following commands produce CSV files with a small number of fields: +Since we're going to do more analysis with the Pandas library to answer our question, we will want to create the CSV with only the columns of interest. This will reduce the time and amount of RAM you need to load your dataset. For example, the following commands produce CSV files with a small number of fields: -``` +```shell twarc2 csv --output-columns id,created_at,author_id,text,referenced_tweets.retweeted.id,public_metrics.like_count echidna.json echidna_minimal.csv + twarc2 csv --output-columns id,created_at,author_id,text,referenced_tweets.retweeted.id,public_metrics.like_count platypus.json platypus_minimal.csv ``` @@ -315,7 +366,7 @@ It's tempting to try to open these CSV files directly in Excel, but if you do yo 3. Tweets may be broken up on newlines. 4. Excel can only support 1,048,576 rows - it's very easy to collect tweet datasets bigger than this. -![Screenshot of the broken CSV file opened directly in excel]() +![Screenshot of the broken CSV file opened directly in excel](images/excel_echidna.png) If you save a file from Excel with any of those problems that file is no longer useful for most purposes (this is a common and longstanding problem with using spreadsheet software, that affects many fields. For example in genomics: https://www.nature.com/articles/d41586-021-02211-4). While it is possible to make Excel do the right thing with your data, it takes more work, and a single mistake can lead to loss of important data. Therefore our recommendation is, if possible, to avoid the use of spreadsheets for analysing Twitter data. @@ -324,6 +375,8 @@ If you save a file from Excel with any of those problems that file is no longer If you are going to be using the scientific Python library [Pandas](https://pandas.pydata.org/) for any processing or analysis, you may wish use Pandas methods to load your data. This is used to load and manipulate data like we have in our CSV file. Note that for this section we're going to run a very simple computation, the references will have links to more extensive resources for learning more. ```python +# process_monotremes.py + import pandas echidna = pandas.read_csv("echidna_minimal.csv") @@ -337,26 +390,31 @@ print(f"Total likes on echidna tweets: {echidna_likes}. Total likes on platypus Run this script through Python to see which of the monotremes is the coolest: -`python process_monotremes.py` +```shell +python process_monotremes.py +``` ### Answering the research question: which monotreme is the coolest? At the time of creating this tutorial, the above script run with the just collected data leads to the following result: -`Total likes on echidna tweets: 1787652. Total likes on platypus tweets: 3462715.` +```shell +Total likes on echidna tweets: 1787652. Total likes on platypus tweets: 3462715. +``` -On that basis, we can conclude that at the time of running this search the platypus is nearly twice as cool as measured by Twitter likes. +On that basis, we can conclude that at the time of running this search the platypus is nearly twice as cool as measured by Twitter likes. Of course this is a simplistic approach to answering this specific research question - we could have made many other choices. Even using a simple quantitative approach looking at metrics: we could have chosen to look at other engagement counts like the number of retweets, or looked at the number of followers of the accounts tweeting about each animal (because a "cooler" account will have more followers). Much of the challenge in using Twitter for research is both about asking the right research question and also the choosing the right approach to the data to address that research question. -## Prepare a dataset for sharing / using a shared dataset +## Prepare a dataset for sharing/using a shared dataset Having performed this analysis and come to a conclusion, it is good practice to share the underlying data so other people can reproduce these results (with some caveats). Noting that we want to preserve Twitter user's agency over the availability of their content, and Twitter's Developer Agreement, we can do this by creating a dataset of tweet IDs. Instead of sharing the content of the tweets, we can share the unique ID for that tweet, which allows others to `hydrate` the tweets by retrieving them again from the Twitter API. This can be done as follows using twarc's `dehydrate` command: -``` +```shell twarc2 dehydrate --id-type tweets platypus.json platypus_ids.txt + twarc2 dehydrate --id-type tweets echidna.json echidna_ids.txt ``` @@ -364,8 +422,9 @@ These commands will produce the two text files, with each line in these files co To `hydrate`, or retrieve the tweets again, we can use the corresponding commands: -``` +```shell twarc2 hydrate platypus_ids.txt platypus_hydrated.json + twarc2 hydrate echidna_ids.txt echidna_hydrated.json ``` diff --git a/setup.py b/setup.py index 1794c1d3..91d32459 100644 --- a/setup.py +++ b/setup.py @@ -25,7 +25,7 @@ classifiers=[ "License :: OSI Approved :: MIT License", ], - python_requires=">=3.3", + python_requires=">=3.6", install_requires=dependencies, setup_requires=["pytest-runner"], tests_require=[ diff --git a/test_twarc2.py b/test_twarc2.py index 429c0b5c..715fe86d 100644 --- a/test_twarc2.py +++ b/test_twarc2.py @@ -31,7 +31,7 @@ ) -def test_version(): +def atest_version(): import setup assert setup.version == version @@ -40,7 +40,7 @@ def test_version(): assert f"twarc/{version}" in user_agent -def test_auth_types_interaction(): +def atest_auth_types_interaction(): """ Test the various options for configuration work as expected. """ @@ -81,7 +81,7 @@ def test_auth_types_interaction(): tw.sample() -def test_sample(): +def atest_sample(): # event to tell the filter stream to close event = threading.Event() @@ -105,18 +105,18 @@ def test_search_recent(sort_order): found_tweets = 0 pages = 0 - for response_page in T.search_recent("#auspol", sort_order=sort_order): + for response_page in T.search_recent("politics", sort_order=sort_order): pages += 1 tweets = response_page["data"] found_tweets += len(tweets) - if pages == 3: + if pages == 2: break - assert 200 <= found_tweets <= 300 + assert 100 <= found_tweets <= 200 -def test_counts_recent(): +def atest_counts_recent(): found_counts = 0 @@ -132,7 +132,7 @@ def test_counts_recent(): os.environ.get("SKIP_ACADEMIC_PRODUCT_TRACK") != None, reason="No Academic Research Product Track access", ) -def test_counts_empty_page(): +def atest_counts_empty_page(): found_counts = 0 @@ -148,7 +148,7 @@ def test_counts_empty_page(): assert found_counts == 72 -def test_search_times(): +def atest_search_times(): found = False now = datetime.datetime.now(tz=pytz.timezone("Australia/Melbourne")) # twitter api doesn't resolve microseconds so strip them for comparison @@ -169,7 +169,7 @@ def test_search_times(): assert found -def test_user_ids_lookup(): +def atest_user_ids_lookup(): users_found = 0 users_not_found = 0 @@ -189,7 +189,7 @@ def test_user_ids_lookup(): assert users_found + users_not_found == 999 -def test_usernames_lookup(): +def atest_usernames_lookup(): users_found = 0 usernames = ["jack", "barackobama", "rihanna"] for response in T.user_lookup(usernames, usernames=True): @@ -198,7 +198,7 @@ def test_usernames_lookup(): assert users_found == 3 -def test_tweet_lookup(): +def atest_tweet_lookup(): tweets_found = 0 tweets_not_found = 0 @@ -227,7 +227,7 @@ def test_tweet_lookup(): os.environ.get("GITHUB_ACTIONS") != None, reason="stream() seems to throw a 400 error under GitHub Actions?!", ) -def test_stream(): +def atest_stream(): # remove any active stream rules rules = T.get_stream_rules() if "data" in rules and len(rules["data"]) > 0: @@ -280,7 +280,7 @@ def test_stream(): assert "data" not in rules -def test_timeline(): +def atest_timeline(): """ Test the user timeline endpoints. @@ -301,7 +301,7 @@ def test_timeline(): assert found >= 200 -def test_timeline_username(): +def atest_timeline_username(): """ Test the user timeline endpoints with username. @@ -322,12 +322,12 @@ def test_timeline_username(): assert found >= 200 -def test_missing_timeline(): +def atest_missing_timeline(): results = T.timeline(1033441111677788160) assert len(list(results)) == 0 -def test_follows(): +def atest_follows(): """ Test followers and and following. @@ -349,7 +349,7 @@ def test_follows(): assert found >= 1000 -def test_follows_username(): +def atest_follows_username(): """ Test followers and and following by username. @@ -371,7 +371,7 @@ def test_follows_username(): assert found >= 1000 -def test_flattened(): +def atest_flattened(): """ This test uses the search API to test response flattening. It will look at each tweet to find evidence that all the expansions have worked. Once it @@ -457,7 +457,7 @@ def test_flattened(): assert found_referenced_tweets, "found referenced tweets" -def test_ensure_flattened(): +def atest_ensure_flattened(): resp = next(T.search_recent("twitter", max_results=20)) # flatten a response @@ -510,7 +510,7 @@ def test_ensure_flattened(): twarc.expansions.ensure_flattened([[{"data": {"fake": "list_of_lists"}}]]) -def test_ensure_flattened_errors(): +def atest_ensure_flattened_errors(): """ Test that ensure_flattened doesn't return tweets for API responses that only contain errors. """ @@ -518,7 +518,7 @@ def test_ensure_flattened_errors(): assert twarc.expansions.ensure_flattened(data) == [] -def test_ensure_user_id(): +def atest_ensure_user_id(): """ Test _ensure_user_id's ability to discriminate correctly between IDs and screen names. @@ -538,7 +538,7 @@ def test_ensure_user_id(): assert T._ensure_user_id(1033441111677788160) == "1033441111677788160" -def test_liking_users(): +def atest_liking_users(): # This is one of @jack's tweets about the Twitter API likes = T.liking_users(1460417326130421765) @@ -554,7 +554,7 @@ def test_liking_users(): break -def test_retweeted_by(): +def atest_retweeted_by(): # This is one of @jack's tweets about the Twitter API retweet_users = T.retweeted_by(1460417326130421765) @@ -570,7 +570,7 @@ def test_retweeted_by(): break -def test_liked_tweets(): +def atest_liked_tweets(): # What has @jack liked? liked_tweets = T.liked_tweets(12) @@ -586,61 +586,61 @@ def test_liked_tweets(): break -def test_list_lookup(): +def atest_list_lookup(): parks_list = T.list_lookup(715919216927322112) assert "data" in parks_list assert parks_list["data"]["name"] == "National-parks" -def test_list_members(): +def atest_list_members(): response = list(T.list_members(715919216927322112)) assert len(response) == 1 members = twarc.expansions.flatten(response[0]) assert len(members) == 8 -def test_list_followers(): +def atest_list_followers(): response = list(T.list_followers(715919216927322112)) assert len(response) >= 2 followers = twarc.expansions.flatten(response[0]) assert len(followers) > 50 -def test_list_memberships(): +def atest_list_memberships(): response = list(T.list_memberships("64flavors")) assert len(response) == 1 lists = twarc.expansions.flatten(response[0]) assert len(lists) >= 9 -def test_followed_lists(): +def atest_followed_lists(): response = list(T.followed_lists("nasa")) assert len(response) == 1 lists = twarc.expansions.flatten(response[0]) assert len(lists) >= 1 -def test_owned_lists(): +def atest_owned_lists(): response = list(T.owned_lists("nasa")) assert len(response) >= 1 lists = twarc.expansions.flatten(response[0]) assert len(lists) >= 11 -def test_list_tweets(): +def atest_list_tweets(): response = next(T.list_tweets(715919216927322112)) assert "data" in response tweets = twarc.expansions.flatten(response) assert len(tweets) >= 90 -def test_user_lookup_non_existent(): +def atest_user_lookup_non_existent(): with pytest.raises(ValueError): # This user does not exist, and a value error should be raised T._ensure_user("noasdfasdf") -def test_twarc_metadata(): +def atest_twarc_metadata(): # With metadata (default) event = threading.Event() @@ -667,7 +667,7 @@ def test_twarc_metadata(): T.metadata = True -def test_docs_requirements(): +def atest_docs_requirements(): """ Make sure that the mkdocs requirements has everything that is in the twarc requirements so the readthedocs build doesn't fail. @@ -678,7 +678,7 @@ def test_docs_requirements(): assert twarc_reqs.issubset(mkdocs_reqs) -def test_geo(): +def atest_geo(): print(T.geo(query="Silver Spring"))