jbickford@ip-XXX-XX-XX-XXX:~$ docker --version Docker version 20.10.12, build e91ed57 jbickford@ip-XXX-XX-XX-XXX:~$ sudo docker run --rm -it -v "/home/jbickford/Desktop/AUTdata/data" aut [sudo] password for jbickford: Sorry, try again. [sudo] password for jbickford: WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int) WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release tay22/03/23 09:11:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://05e5b3d391d7:4040 Spark context available as 'sc' (master = local[*], app id = local-1648026727460). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.1.1 /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.14.1) Type in expressions to have them evaluated. Type :help for more information. scala> :paste // Entering paste mode (ctrl-D to finish) import io.archivesunleashed._ import io.archivesunleashed.udfs._ RecordLoader.loadArchives("/aut-resources/Sample-Data/*.gz", sc) .all() .keepValidPagesDF() .groupBy(extractDomain($"url").alias("domain")) .count() .sort($"count".desc) .show(10, false) // Exiting paste mode, now interpreting. 22/03/23 09:14:03 WARN SparkSession$Builder: Using an existing SparkSession; some spark core configurations may not take effect. 22/03/23 09:14:08 WARN PDFParser: J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies. 22/03/23 09:14:08 WARN SQLite3Parser: org.xerial's sqlite-jdbc is not loaded. Please provide the jar on your classpath to parse sqlite files. See tika-parsers/pom.xml for the correct version. +---------------------+-----+ |domain |count| +---------------------+-----+ |equalvoice.ca |4274 | |liberal.ca |1981 | |policyalternatives.ca|588 | |greenparty.ca |535 | |fairvote.ca |442 | |ndp.ca |416 | |davidsuzuki.org |348 | |canadiancrc.com |88 | |communist-party.ca |39 | |ccsd.ca |22 | +---------------------+-----+ only showing top 10 rows import io.archivesunleashed._ import io.archivesunleashed.udfs._ scala> :paste // Entering paste mode (ctrl-D to finish) import io.archivesunleashed._ import io.archivesunleashed.udfs._ RecordLoader.loadArchives("/data/*.gz", sc) .all() .keepValidPagesDF() .groupBy(extractDomain($"url").alias("domain")) .count() .sort($"count".desc) .show(10, false) // Exiting paste mode, now interpreting. java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:126) at org.apache.hadoop.fs.Path.(Path.java:134) at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:245) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:469) at org.apache.spark.SparkContext.$anonfun$newAPIHadoopFile$2(SparkContext.scala:1248) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.SparkContext.withScope(SparkContext.scala:786) at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:1236) at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:105) ... 59 elided scala> jbickford@ip-XXX-XX-XX-XXX:~$