Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARC file name appearing in url list #516

Closed
ianmilligan1 opened this issue May 11, 2021 · 0 comments · Fixed by #517
Closed

ARC file name appearing in url list #516

ianmilligan1 opened this issue May 11, 2021 · 0 comments · Fixed by #517
Labels

Comments

@ianmilligan1
Copy link
Member

Describe the bug
When extracting a list of URLs in an ARC file, the filename appears among the list of URLs.

To Reproduce
Using the sample ARC found in our ant-resources repository, the following script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._
RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz", sc).all()
  .select($"crawl_date", $"url")
  .show(10, false)

Leads to these responses:

+----------+-----------------------------------------------------------------------------------------------+
|crawl_date|url                                                                                            |
+----------+-----------------------------------------------------------------------------------------------+
|20060622  |filedesc://ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc|
|20060622  |http://www.gca.ca/indexcms/?organizations&orgid=27                                             |
|20060622  |http://www.ppforum.com/en/speeches/index.asp?theme=all&year=2003                               |
|20060622  |http://www.nosharia.com/telgraaf,%20nedeland_files/image010.gif                                |
|20060622  |http://www.canadianlandmine.org/french/french/n1kd.cfm                                         |
|20060622  |http://canadianactionparty.ca/temp/Important_Memos/Some_facts_about_the_Census.doc             |
|20060622  |http://coat.ncf.ca/our_magazine/links/53/smith_a.jpg                                           |
|20060622  |http://communist-party.ca/calendar/cal_week.php?op=week&date=2006-08-18&catview=0              |
|20060622  |http://www.nawl.ca/ns/en/documents/Pub_Brief_Antiterror01_en.doc                               |
|20060622  |http://www.conservative.ca/EN/1018/                                                            |
+----------+-----------------------------------------------------------------------------------------------+

The same behaviour is not found in WARC files.

Expected behavior
The filedesc probably shouldn't be in the results list here.

Environment information

  • AUT version: HEAD
  • OS: MacOS 11.3.1
  • Java version: Java 11
  • Apache Spark version: 3.0.0
  • Apache Spark w/aut: --jars
  • Apache Spark command used to run AUT: ./bin/spark-shell --jars ~/dropbox/git/aut/target/aut-0.90.2-SNAPSHOT-fatjar.jar

Additional context
@ruebot

ruebot added a commit that referenced this issue May 12, 2021
- Resolves #516
- add removeFiledesc method, and apply it
- update tests
ianmilligan1 pushed a commit that referenced this issue May 12, 2021
- Resolves #516
- add removeFiledesc method, and apply it
- update tests
@ruebot ruebot added the bug label May 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants