Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.Set$Set1 Set(liberal.ca) #529

Closed
JakeBickUKGWA opened this issue Mar 29, 2022 · 1 comment

Comments

@JakeBickUKGWA
Copy link

Describe the bug
I am working through the AUT walkthrough at: https://aut.docs.archivesunleashed.org/docs/toolkit-walkthrough. I used the installation instructions at: https://github.com/archivesunleashed/docker-aut#build-and-run. I am using an ubuntu-based EC2 instance.

I can run the first step ok and get the count of domains in your sample material (though it does give a few errors at the beginning).

But if I try the second step to extract text it just gives me this error message:

java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.Set$Set1 Set(liberal.ca)
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:90)
at org.apache.spark.sql.catalyst.expressions.Literal$.$anonfun$create$2(literals.scala:152)
at scala.util.Failure.getOrElse(Try.scala:222)
at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:152)
at org.apache.spark.sql.functions$.typedLit(functions.scala:131)
at org.apache.spark.sql.functions$.lit(functions.scala:114)
... 59 elided

I've copied the full terminal content to the attached text file.

To Reproduce
Steps to reproduce the behavior (e.g.):

  1. Start AUT (I use sudo docker run --rm -it -v "/home/jbickford/Desktop/AUTdata:/data" aut)
  2. In paste mode, run
import io.archivesunleashed._
import io.archivesunleashed.udfs._

RecordLoader.loadArchives("/aut-resources/Sample-Data/*.gz", sc)
  .all()
  .keepValidPagesDF()
  .groupBy(extractDomain($"url").alias("domain"))
  .count()
  .sort($"count".desc)
  .show(10, false)
  1. This gives some errors, but as expected generates a table of the top domains in the sample collection
  2. Again in paste mode, run
import io.archivesunleashed._
import io.archivesunleashed.udfs._

val domains = Set("liberal.ca")

RecordLoader.loadArchives("/aut-resources/Sample-Data/*.gz", sc)
  .webpages()
  .select($"crawl_date", extractDomain($"url").alias("domain"), $"url", $"content")
  .filter(hasDomains($"domain", lit(domains)))
  .write.csv("/data/liberal-party-text")

This generates the java.lang.RuntimeException error mentioned above.

Expected behavior
As I understand it AUT should generate a folder called liberal-party-text, containing extracted text files from the sample data.

Screenshots
Attached
error

Environment information

  • AUT version: I'm afraid I'm not sure, it's the version in the docker image in the walkthrough
  • OS: Ubuntu 20.04.4 LTS (in an EC2 instance)
  • Java version: OpenJDK 64-Bit Server VM, Java 11.0.14.1
  • Apache Spark version: 3.11
  • Apache Spark w/aut: sorry, I'm also unsure about this, I'm guessing it's determined by the docker image, but if not let me know
  • Apache Spark command used to run AUT: sudo docker run --rm -it -v "/home/jbickford/Desktop/AUTdata:/data" aut

AUTissue.txt

ruebot added a commit to archivesunleashed/aut-docs that referenced this issue Mar 29, 2022
@ruebot
Copy link
Member

ruebot commented Mar 29, 2022

@JakeBickUKGWA sorry about that, it was a documentation issue. I forgot to update the type used for the variable. It should be Array not Set. The documentation has been updated: https://aut.docs.archivesunleashed.org/docs/toolkit-walkthrough#extracting-some-text

@ruebot ruebot closed this as completed Mar 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants