Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change crawl_date format to YYYYMMDDHHMMSS, update hasDate filter. #526

Merged
merged 1 commit into from
Jan 20, 2022

Conversation

ruebot
Copy link
Member

@ruebot ruebot commented Jan 20, 2022

GitHub issue(s): #525

What does this Pull Request do?

Change crawl_date format to YYYYMMDDHHMMSS, update hasDate filter.

How should this be tested?

  • Build system
  • Tested locally:
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/
         
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.13)
Type in expressions to have them evaluated.
Type :help for more information.

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val test = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/ars-cloud/in/14462/arcs",sc)
  .webpages()
  .select($"url", $"crawl_date")

// Exiting paste mode, now interpreting.

import io.archivesunleashed._
import io.archivesunleashed.udfs._
test: org.apache.spark.sql.DataFrame = [url: string, crawl_date: string]

scala> test.cache()
res0: test.type = [url: string, crawl_date: string]

scala> test.count()
res1: Long = 2811                                                              

scala> test.show(25)
+--------------------+--------------+
|                 url|    crawl_date|
+--------------------+--------------+
|https://www.youtu...|20201124212851|
|https://www.youtu...|20201124212903|
|https://www.youtu...|20201124212918|
|https://archivesu...|20201124212920|
|https://www.youtu...|20201124212930|
|https://archivesu...|20201224212956|
|https://www.ianmi...|20201224212643|
| https://schema.org/|20201224212809|
|https://www.ianmi...|20201224213124|
|https://www.ianmi...|20201224213146|
|https://www.ianmi...|20201224213212|
|https://www.ianmi...|20201224213230|
|https://www.ianmi...|20201224213300|
|https://www.ianmi...|20201224213319|
|https://www.ianmi...|20201224213335|
|https://www.ianmi...|20201224213353|
|https://www.ianmi...|20201224213425|
|https://www.ianmi...|20201224213443|
|https://www.youtu...|20201224213456|
|https://m.youtube...|20201224213516|
|https://www.ianmi...|20201224213546|
|https://www.ianmi...|20201224213627|
|https://www.ianmi...|20201224213701|
|https://www.ianmi...|20201224213727|
|https://www.ianmi...|20201224213757|
+--------------------+--------------+
only showing top 25 rows


scala> val date = Array("20201124212851")

scala> test.filter(hasDate($"crawl_date", lit(date))).count()
res4: Long = 1

scala> test.filter(!hasDate($"crawl_date", lit(date))).count()
res6: Long = 2810

scala> test.filter(hasDate($"crawl_date", lit(Array("20201224212.*")))).count()
res22: Long = 6

scala> test.filter(!hasDate($"crawl_date", lit(Array("20201224212.*")))).count()
res26: Long = 2805

scala> test.filter(hasDate($"crawl_date", lit(Array("2020.*")))).count()
res27: Long = 2625

scala> test.filter(!hasDate($"crawl_date", lit(Array("2020.*")))).count()
res28: Long = 186

Additional Notes:

I can cut a release if y'all want, and I have updated documentation ready to push up once this is merged.

Δ docs/filters-df.md

───────────────────────────────────────────────────┐
38: WebArchive(sc, sqlContext, "/path/to/warcs") \ │
───────────────────────────────────────────────────┘
│ 38 │                                                               │ 38 │
│ 39 │## Has Dates                                                   │ 39 │## Has Dates
│ 40 │                                                               │ 40 │
│ 41 │Filters or keeps all data that does or does not match the date→│ 41 │Filters or keeps all data that does or does not match the time→
│ 42 │                                                               │ 42 │
│ 43 │### Scala DF                                                   │ 43 │### Scala DF
│ 44 │                                                               │ 44 │

─────────────────────────────────────────────────────────────────────────────────┐
46: Filters or keeps all data that does or does not match the date(s) specified. │
─────────────────────────────────────────────────────────────────────────────────┘
│ 46 │import io.archivesunleashed._                                  │ 46 │import io.archivesunleashed._
│ 47 │import io.archivesunleashed.udfs._                             │ 47 │import io.archivesunleashed.udfs._
│ 48 │                                                               │ 48 │
│ 49 │val dates = Array("2008", "200908", "20070502")                │ 49 │val dates = Array("2008.*", "200908.*", "20070502231159")
│ 50 │                                                               │ 50 │
│ 51 │RecordLoader.loadArchives("/path/to/warcs",sc)                 │ 51 │RecordLoader.loadArchives("/path/to/warcs",sc)
│ 52 │  .all()                                                       │ 52 │  .all()

───────────────────────────────────────────────────┐
60: RecordLoader.loadArchives("/path/to/warcs",sc) │
───────────────────────────────────────────────────┘
│ 60 │from aut import *                                              │ 60 │from aut import *
│ 61 │from pyspark.sql.functions import col                          │ 61 │from pyspark.sql.functions import col
│ 62 │                                                               │ 62 │
│ 63 │dates = ["2008", "200908", "20070502"]                         │ 63 │dates = ["2008.*", "200908.*", "20070502231159"]
│ 64 │                                                               │ 64 │
│ 65 │WebArchive(sc, sqlContext, "/path/to/warcs") \                 │ 65 │WebArchive(sc, sqlContext, "/path/to/warcs") \
│ 66 │  .all() \                                                     │ 66 │  .all() \

@ruebot ruebot requested a review from ianmilligan1 January 20, 2022 18:09
- Update hasDate filter to match patterns since it only matched literals
  previously
- Resolves #525
- Update tests as required
@codecov
Copy link

codecov bot commented Jan 20, 2022

Codecov Report

Merging #526 (9f5a46b) into main (8104a65) will increase coverage by 0.07%.
The diff coverage is 87.09%.

@@             Coverage Diff              @@
##               main     #526      +/-   ##
============================================
+ Coverage     88.83%   88.91%   +0.07%     
  Complexity       57       57              
============================================
  Files            43       43              
  Lines          1012     1046      +34     
  Branches         85       86       +1     
============================================
+ Hits            899      930      +31     
- Misses           74       75       +1     
- Partials         39       41       +2     

@ruebot
Copy link
Member Author

ruebot commented Jan 20, 2022

@ianmilligan1 you can ignore he codecov/patch check.

Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Builds nicely locally and tested it out. 👍

@ianmilligan1 ianmilligan1 merged commit 73354e8 into main Jan 20, 2022
@ianmilligan1 ianmilligan1 deleted the issue-525 branch January 20, 2022 19:16
ruebot added a commit to archivesunleashed/aut-docs that referenced this pull request Jan 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Include timestamp in crawl date
2 participants