Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add filters for fileSize and/or imageSize to atlas_media? #140

Closed
daxkellie opened this issue Mar 25, 2022 · 3 comments
Closed

Add filters for fileSize and/or imageSize to atlas_media? #140

daxkellie opened this issue Mar 25, 2022 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@daxkellie
Copy link
Contributor

It was suggested in a separate issue to add a fileSize filter for downloading media. This is potentially a good idea worth discussing, and I don't imagine it being too difficult to implement for atlas_media

Not to dump more requests into the same Issue, but I think a fileSize filter or an imageWidth/Height filter would be fantastic for uses like mine.

@mjwestgate
Copy link
Collaborator

This is an interesting one, as it may require a rethink of how atlas_media works. The current workflow for atlas_media is as follows:

  • Filters are first passed to atlas_occurrences, with a bespoke select function to add 'media' fields to the result. Each of these fields contains unique identifiers for media associated with an individual occurrence. If there is more that one file, those identifiers are separated by '|'.
  • Metadata for each ID is downloaded from the images API and joined to the occurrence download - this is the tibble returned by atlas_media containing relevant information such as file size and licence type. Unlike BioCache queries, there is no pre-filtering option with this API, making galah_filter an odd mechanism to interact with this dataset
  • Loop over all image URLs and download to the cache

atlas_media is unusual, therefore, in that it returns images as a side-effect; what is actually returned to the workspace is a tibble (as in atlas_occurrences and elsewhere), not the images themselves. Further, the current function design ensures that the resulting tibble can only be filtered once the images have already been downloaded, which is inefficient (and prevents filtering by licence type as per issue #151). A final problem is that if you have a set of media IDs, there is no way to bypass the atlas_occurrences step within atlas_media and just download the files you want.

Some useful additions would be:

  • allow media categories to be passed via galah_select, and thereby to atlas_occurrences, perhaps including a new group = media option
  • prevent atlas_media from downloading any images, and instead just return a tibble
  • add new function (e.g. get_media) that does the actual download
  • [lower priority] show_all_licences() could be created to hit https://images.ala.org.au/ws/LicenceMapping

A harder problem is how atlas_media should behave.

Option 1 is to force the user to do their own occurrence download first, and pass the results to atlas_media to get image metadata, i.e.

galah_call() |>
   galah_filter(year == 2022) |>
   galah_select(group = c("basic", "media")) |>
   atlas_occurrences() |>
   atlas_media() |>
   dplyr::filter(sizeInBytes < 10^6) |> # optional extra filtering stage
   get_media()

This is more modular than the current version, and doesn't require many new function names; but greatly changes the behaviour of atlas_media by forcing an intermediate call to galah_select and atlas_occurrences.

Option 2 is to keep the filtering behaviour of atlas_media unchanged, but create a new function for users who want a more modular workflow, e.g.

# basic usage
galah_call() |>
  galah_filter(year == 2022) |>
  atlas_media() |>            # returns a tibble only, but doesn't require atlas_occurrences first
  dplyr::filter(sizeInBytes < 10^6, licence == "http://creativecommons.org/licenses/by-sa/4.0/") |>
  get_media(cache = "my_cache", type = "thumbnail") # downloads images

# advanced usage
galah_call() |>
   galah_filter(year == 2022) |>
   galah_select(group = c("basic", "media")) |>
   atlas_occurrences() |>
   show_all_media() |>  # get metadata on images
   dplyr::filter(sizeInBytes < 10^6) |>
   get_media()

mjwestgate added a commit that referenced this issue Jul 21, 2022
…ds` (#140)

First step to modularising `atlas_media` as per issue #140
mjwestgate added a commit that referenced this issue Jul 21, 2022
Necessary lookup table for filtering by licence type, relating to issues #140 and #151
mjwestgate added a commit that referenced this issue Jul 21, 2022
`atlas_media` now returns a `tibble`, but does not download images; this allows use of e.g. `dplyr::filter` to reduce the number of images that will be returned (#140, #151)
- new function `collect_media` takes a tibble from `atlas_media` and downloads to the specified directory. Supports thumbnail downloads via `type` argument (#145)
- Alternatively, users can build their own media queries via `atlas_occurrences` and trigger a metadata download (equivalent to `atlas_media`) using `show_all_media`
@mjwestgate
Copy link
Collaborator

Current status of a simple example:

x <- galah_call() |> 
  galah_filter(year == 2010) |> 
  galah_identify("Litoria peronii") |>  
  atlas_media() |> 
  collect_media(download_dir = "TEST")
28 files were downloaded to path/to/TEST

A more complex example showing how to filter by file size:

galah_call() |> 
  galah_filter(year == 2010) |> 
  galah_identify("Litoria peronii") |>  
  atlas_media() |>  
  dplyr::filter(width > 1000) |> 
  collect_media(download_dir = "TEST", type = "thumbnail")

# A tibble: 16 × 13
   media_id                             mime_type  size_in_bytes date_uploaded       date_taken          height width creator        license       data_…¹ occur…² url   downl…³
   <chr>                                <chr>              <int> <chr>               <chr>                <int> <int> <chr>          <chr>         <chr>   <chr>   <chr> <chr>  
 1 218ef2ec-8bb7-4664-922c-de39031e0d86 image/jpeg        158380 2015-11-07 04:08:14 2015-11-07 04:08:14    830  1024 Robert Bender  ""            dr893   "e5aa8… http… /Users…
 2 f90e4dff-2ecf-4468-a6af-db9e61e2e300 image/jpeg        158380 2015-10-17 04:07:58 2015-10-17 04:07:58    830  1024 Robert Bender  ""            dr893   "d8804… http… /Users…
 3 00e958ef-af1c-41c4-b5a5-c7b1bf7ccdf0 image/jpeg        689628 2019-09-05 23:16:51 2019-09-05 23:16:51   1365  2048 Niko Pax       "http://crea… dr1411  "d9f04… http… /Users…
 4 090f0525-3c2c-422e-b8f3-7c13a3b9319d image/jpeg        756318 2019-09-05 23:16:52 2019-09-05 23:16:52   1365  2048 Niko Pax       "http://crea… dr1411  "d9f04… http… /Users…
 5 3f389785-7ae0-4de6-b324-b99a56910fc9 image/jpeg        910591 2019-09-05 23:16:51 2019-09-05 23:16:51   1365  2048 Niko Pax       "http://crea… dr1411  "d9f04… http… /Users…
 6 734f4c29-1f2c-4cee-adb2-6f3468c00faf image/jpeg        700821 2019-09-05 23:16:52 2019-09-05 23:16:52   1365  2048 Niko Pax       "http://crea… dr1411  "d9f04… http… /Users…
# … with 10 more rows, and abbreviated variable names ¹​data_resource_uid, ²​occurrence_id, ³​download_path

Finally, how to avoid atlas_media completely, first running a custom atlas_occurrences call:

df <- galah_call() |> 
  galah_filter(year == 2010, images != "") |> 
  galah_identify("Litoria peronii") |>  
  galah_select(scientificName, eventDate, images) |>
  atlas_occurrences() 

> df
# A tibble: 19 × 3
   scientificName  eventDate            images                                                                                                                                  
   <chr>           <chr>                <chr>                                                                                                                                   
 1 Litoria peronii 2010-01-12T17:07:50Z fe81d289-fc80-4f30-a6cd-b145200ba423                                                                                                    
 2 Litoria peronii 2010-11-06T22:55:34Z c6c5ed86-d83c-4e0e-9fea-187700d0a328                                                                                                    
 3 Litoria peronii 2010-01-21T13:00:00Z df220f9c-ba53-4ee2-adc6-d5bc2919c317                                                                                                    
 4 Litoria peronii 2010-01-03T10:55:28Z cbc1873c-56b4-4ff1-83bc-226164d1d079                                                                                                    
 5 Litoria peronii 2010-11-07T09:55:34Z ee4b84e1-95bb-4ff1-bfaf-ec251754f0b5                                                                                                    
 6 Litoria peronii 2010-12-21T13:00:00Z 218ef2ec-8bb7-4664-922c-de39031e0d86                                                                                                    
 7 Litoria peronii 2010-12-09T13:00:00Z 44e386a3-842e-4191-a703-0d2df8942000                                                                                                    
 8 Litoria peronii 2010-12-21T13:00:00Z f90e4dff-2ecf-4468-a6af-db9e61e2e300                                                                                                    
 9 Litoria peronii 2010-11-07T09:53:16Z 9b813545-cecc-424b-83ba-89fdd1ebdf02                                                                                                    
10 Litoria peronii 2010-12-09T13:00:00Z a8c93ae3-7a53-4af5-9897-47c9c218f1f2 | c59ee1db-c120-4ad2-9027-b6feac5dee5c                                                             
11 Litoria peronii 2010-01-02T23:55:28Z 3f99a0ee-c337-4ede-8498-25a0fa102b92                                                                                                    
12 Litoria peronii 2010-12-25T06:10:00Z 00e958ef-af1c-41c4-b5a5-c7b1bf7ccdf0 | 090f0525-3c2c-422e-b8f3-7c13a3b9319d | 3f389785-7ae0-4de6-b324-b99a56910fc9 | 734f4c29-1f2c-4cee…
13 Litoria peronii 2010-02-23T08:46:49Z 09df8085-2a94-43c7-a9f1-6acfb0d03186                                                                                                    
14 Litoria peronii 2010-11-06T22:53:16Z 332a5582-ec9e-4471-b00d-e31e35bae3c6                                                                                                    
15 Litoria peronii 2010-02-01T01:17:00Z 6873f70f-bba1-4bc3-a3a3-528426f5e319                                                                                                    
16 Litoria peronii 2010-02-23T08:46:49Z 003f0cd0-70bd-4b28-98ef-955eba0470de | 2a6c5fc7-c3ac-4832-a572-cd248357ac47 | 61015de5-662d-437c-b603-51247ed1c063                      
17 Litoria peronii 2010-02-23T08:46:04Z ee82cc94-a1f8-48a1-8457-c941d55f376f                                                                                                    
18 Litoria peronii 2010-02-23T08:46:04Z 900e6776-c9d5-464b-820d-669dccd90ecc                                                                                                    
19 Litoria peronii 2010-02-23T08:46:49Z f8b8f786-a579-4f0c-a13f-2cdce9537c04    

Then getting associated media:

df |> 
  show_all_media() |>
  dplyr::filter(width > 1000) |> 
  collect_media(download_dir = "TEST", type = "thumbnail")

16 files were downloaded to /Users/wes186/Documents/Work/Development/AtlasOfLivingAustralia/Package_galah/galah/TEST
# A tibble: 16 × 13
   media_id                             mime_type  size_in_bytes date_uploaded       date_taken          height width creator        license       data_…¹ occur…² url   downl…³
   <chr>                                <chr>              <int> <chr>               <chr>                <int> <int> <chr>          <chr>         <chr>   <chr>   <chr> <chr>  
 1 df220f9c-ba53-4ee2-adc6-d5bc2919c317 image/jpeg        271384 2021-06-25 11:55:09 2021-06-25 11:55:09    768  1024 bpalmerau      "http://crea… dr1411  ""      http… /Users…
 2 218ef2ec-8bb7-4664-922c-de39031e0d86 image/jpeg        158380 2015-11-07 04:08:14 2015-11-07 04:08:14    830  1024 Robert Bender  ""            dr893   "e5aa8… http… /Users…
 3 44e386a3-842e-4191-a703-0d2df8942000 image/jpeg        123787 2014-05-20 12:02:41 2014-05-20 12:02:41    778  1024 Ken Walker     ""            dr893   "5f420… http… /Users…
 4 f90e4dff-2ecf-4468-a6af-db9e61e2e300 image/jpeg        158380 2015-10-17 04:07:58 2015-10-17 04:07:58    830  1024 Robert Bender  ""            dr893   "d8804… http… /Users…
 5 a8c93ae3-7a53-4af5-9897-47c9c218f1f2 image/jpeg        103675 2019-09-12 18:27:42 2019-09-12 18:27:42    770  1024 Ken Walker     "http://crea… dr1411  "ac3db… http… /Users…
 6 c59ee1db-c120-4ad2-9027-b6feac5dee5c image/jpeg        121007 2019-07-06 11:08:39 2019-07-06 11:08:39    778  1024 Ken Walker     "http://crea… dr1411  "ac3db… http… /Users…
 7 00e958ef-af1c-41c4-b5a5-c7b1bf7ccdf0 image/jpeg        689628 2019-09-05 23:16:51 2019-09-05 23:16:51   1365  2048 Niko Pax       "http://crea… dr1411  "d9f04… http… /Users…
 8 090f0525-3c2c-422e-b8f3-7c13a3b9319d image/jpeg        756318 2019-09-05 23:16:52 2019-09-05 23:16:52   1365  2048 Niko Pax       "http://crea… dr1411  "d9f04… http… /Users…
 9 3f389785-7ae0-4de6-b324-b99a56910fc9 image/jpeg        910591 2019-09-05 23:16:51 2019-09-05 23:16:51   1365  2048 Niko Pax       "http://crea… dr1411  "d9f04… http… /Users…
10 734f4c29-1f2c-4cee-adb2-6f3468c00faf image/jpeg        700821 2019-09-05 23:16:52 2019-09-05 23:16:52   1365  2048 Niko Pax       "http://crea… dr1411  "d9f04… http… /Users…
11 9b651b9b-5547-4ae1-bcb9-dab15ac86c29 image/jpeg        702069 2015-10-21 12:08:57 2015-10-21 12:08:57   1365  2048 Niko Pax       "http://crea… dr1411  "d9f04… http… /Users…
12 ee84d419-80a5-4386-8cfe-02a88a53238b image/jpeg        659726 2019-09-05 23:16:50 2019-09-05 23:16:50   1365  2048 Niko Pax       "http://crea… dr1411  "d9f04… http… /Users…
13 6873f70f-bba1-4bc3-a3a3-528426f5e319 image/jpeg        587538 2020-06-08 03:12:28 2020-06-08 03:12:28   1207  1811 davidcoleby    "http://crea… dr1411  "150bf… http… /Users…
14 003f0cd0-70bd-4b28-98ef-955eba0470de image/jpeg        833794 2020-06-08 03:11:00 2020-06-08 03:11:00   1366  2048 Arthur Chapman "http://crea… dr1411  "9f633… http… /Users…
15 2a6c5fc7-c3ac-4832-a572-cd248357ac47 image/jpeg        890271 2020-06-08 03:10:50 2020-06-08 03:10:50   1366  2048 Arthur Chapman "http://crea… dr1411  "9f633… http… /Users…
16 61015de5-662d-437c-b603-51247ed1c063 image/jpeg       1154857 2020-06-08 03:11:05 2020-06-08 03:11:05   1366  2048 Arthur Chapman "http://crea… dr1411  "9f633… http… /Users…
# … with abbreviated variable names ¹​data_resource_uid, ²​occurrence_id, ³​download_path

It would not be impossible to walk back these changes back into atlas_media; but I think that splitting the metadata download from the media download is quite sensible.

mjwestgate added a commit that referenced this issue Jul 21, 2022
mjwestgate added a commit that referenced this issue Jul 25, 2022
mjwestgate added a commit that referenced this issue Jul 27, 2022
- force `show_all_media` to only accept a `data.frame` (not a `vector`)
- append supplied columns in long-form to the output (as per original code)
- does not support `galah_select` or `galah_filter`
@mjwestgate
Copy link
Collaborator

This is now possible using dplyr:filter on tibble returned by atlas_media

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants