We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
url
Describe the bug When extracting a list of URLs in an ARC file, the filename appears among the list of URLs.
To Reproduce Using the sample ARC found in our ant-resources repository, the following script:
import io.archivesunleashed._ import io.archivesunleashed.udfs._ RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz", sc).all() .select($"crawl_date", $"url") .show(10, false)
Leads to these responses:
+----------+-----------------------------------------------------------------------------------------------+ |crawl_date|url | +----------+-----------------------------------------------------------------------------------------------+ |20060622 |filedesc://ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc| |20060622 |http://www.gca.ca/indexcms/?organizations&orgid=27 | |20060622 |http://www.ppforum.com/en/speeches/index.asp?theme=all&year=2003 | |20060622 |http://www.nosharia.com/telgraaf,%20nedeland_files/image010.gif | |20060622 |http://www.canadianlandmine.org/french/french/n1kd.cfm | |20060622 |http://canadianactionparty.ca/temp/Important_Memos/Some_facts_about_the_Census.doc | |20060622 |http://coat.ncf.ca/our_magazine/links/53/smith_a.jpg | |20060622 |http://communist-party.ca/calendar/cal_week.php?op=week&date=2006-08-18&catview=0 | |20060622 |http://www.nawl.ca/ns/en/documents/Pub_Brief_Antiterror01_en.doc | |20060622 |http://www.conservative.ca/EN/1018/ | +----------+-----------------------------------------------------------------------------------------------+
The same behaviour is not found in WARC files.
Expected behavior The filedesc probably shouldn't be in the results list here.
filedesc
Environment information
--jars
./bin/spark-shell --jars ~/dropbox/git/aut/target/aut-0.90.2-SNAPSHOT-fatjar.jar
Additional context @ruebot
The text was updated successfully, but these errors were encountered:
Filter or filedesc and dns records from arcs.
a392974
- Resolves #516 - add removeFiledesc method, and apply it - update tests
Filter or filedesc and dns records from arcs. (#517)
a6d3265
Successfully merging a pull request may close this issue.
Describe the bug
When extracting a list of URLs in an ARC file, the filename appears among the list of URLs.
To Reproduce
Using the sample ARC found in our ant-resources repository, the following script:
Leads to these responses:
The same behaviour is not found in WARC files.
Expected behavior
The
filedesc
probably shouldn't be in the results list here.Environment information
--jars
./bin/spark-shell --jars ~/dropbox/git/aut/target/aut-0.90.2-SNAPSHOT-fatjar.jar
Additional context
@ruebot
The text was updated successfully, but these errors were encountered: