Releases: archivesunleashed/aut
Releases · archivesunleashed/aut
aut-0.80.0
Documentation
Release Notes
Closed issues:
- Broken link in documentation #476
- Improve udfs/package.scala test coverage #473
- Remove tabDelimit #471
- Remove Extract Entities #469
- PEP8 Naming - UDFs, App method names, DataFrame names, and filters. #468
- Python UDFs - class or not? #467
- Remove ExtractImageDetailsDF.scala #464
- github-stite-deploy uses password based authentication which is being deprecated by GitHub #461
- Implement Python versions of Serializable APIs #410
- Implement Python versions of App utilities #409
- Implement Python versions of Matchbox utilities #408
- Improve TupleFormatter.scala test coverage #59
- Create tests for NERCombinedJson.scala #53
- Create tests for NER3Classifier.scala #52
- Create tests for ExtractEntities.scala #48
Merged pull requests:
- Remove RDD suffixes on file, class, and object names. #479 (ruebot)
- PEP8 Python app method names. #477 (ruebot)
- Move Python UDF methods out of their own class. #475 (ruebot)
- Add DataFrame udf tests. #474 (ruebot)
- Remove tabDelimit. #472 (ruebot)
- Remove NER functionality. #470 (ruebot)
- Add ExtractPopularImages, WriteGEXF, and WriteGraphML to Python. #466 (ruebot)
- Remove ExtractImageDetailsDF; resolves #464. #465 (ruebot)
- Implement Scala Matchbox UDFs in Python. #463 (ruebot)
- Import clean-up for df package. #462 (ruebot)
aut-0.70.0
Documentation
Release Notes
Implemented enhancements:
- Update PlainTextExtractor to just extract text #452
- Migration of all RDD functionality over to DataFrames #223
Fixed bugs:
- DomainFrequencyExtractor should remove WWW prefix #456
Closed issues:
- For extractor (spark-submit) job, set Spark app name to be the extractor job name. #458
- Remove RDD options from app #449
- Add parquet as an app format option #448
- Add datathon derivatives to app (binary info, web pages, web graph #447
- Update Java 8 instructions for MacOS #445
- Add spark-submit to README #444
Merged pull requests:
- [skip travis] README updates #460 (ruebot)
- Set spark-submit app name to be "aut - extractorName". #459 (ruebot)
- Add RemovePrefixWWWDF to DomainFrequencyExtractor. #457 (ruebot)
- Updating Java install instructions for MacOS, resolves #445 #455 (ianmilligan1)
- Add option to save to Parquet for app. #454 (ruebot)
- Update PlainTextExtractor to output a single column; text. #453 (ruebot)
- Add a number of additional app extractors. #451 (ruebot)
- Remove RDD option in app; DataFrame only now. #450 (ruebot)
- [skip-travis] Add spark-submit option to README; resolves #444. #446 (ruebot)
aut 0.60.0
Documentation
Release Notes
Implemented enhancements:
- Discussion: Restyle UDFs in the context of DataFrames #425
- Add alt text column to imageGraph (imageLinks) #420
- UDFs that filter on url should also filter on src #418
Fixed bugs:
- CommandLineApp DomainGraphExtractor Uses Different Node IDs than WriteGraph #439
- DomainGraphExtractor produces different output in RDD vs DF #436
- Command line app fails because of missing log4j configuration #433
Closed issues:
- Remove GraphXML and ExtractGraphX #442
- Use Monochromatic Ids instead of hash to produce network identifiers. #440
- Add graphml output to DomainGraphExtractor #435
- Add webgraph, imagegraph, webpages, etc. to command line app #431
- Rename imageLinks to imageGraph #419
Merged pull requests:
- Remove GraphX support; resolves #442. #443 (ruebot)
- Remove WriteGraph; resolves #439. #441 (ruebot)
- Add graphml output to CommandLineApp and DomainGraphExtractor. #438 (ruebot)
- Align RDD and DF output for DomainGraphExtractor. #437 (ruebot)
- Update log4j configuration to resolve #433. #434 (ruebot)
- Add imagegraph, and webgraph to command line app. #432 (ruebot)
- Tweak hasDate to handle Seq. #430 (ruebot)
- Restyle keep/discard filter UDFs in the context of DataFrames #429 (ruebot)
- Update Spark and Hadoop versions. #426 (ruebot)
- update for 'src' column #424 (SinghGursimran)
- [skip travis] Add pre-print link to README. #423 (ruebot)
- Add img alt text to imagegraph(); resolves #420. #422 (ruebot)
- Rename imageLinks to imageGraph; resolves #419 #421 (ruebot)
- Need --repositories flag with --packages. #417 (ruebot)
aut 0.50.0
Documentation
Release Notes
Implemented enhancements:
- Enhance keepValidPages #359
- Add discardLanguage filter #352
- Add crawl_date to binary DataFrames and imageLinks #413
Fixed bugs:
- textFiles does not filter properly #390
- DataFrame error with text files: java.net.MalformedURLException: unknown protocol: filedesc #362
Closed issues:
- .webpages() additional tokenized columns? #402
- Test and documentation inventory #372
- Missing doc comments #392
- Bug in ArcTest? Why run RemoveHTML? #369
- UDF CaMeL cASe consistency issues #368
- ExtractDomain or ExtractBaseDomain? #367
- Align DataFrame boilerplate in Python and Scala #366
- Create a ComputeSHA1 method #363
- Discussion: Should we align our Named Entity Recognition output with WANE format? #297
- DataFrame discussion: open thread #190
Merged pull requests:
- Clean up test descriptions, addresses #372. #416 (ruebot)
- Remaining Matchbox implementations for Scala #415 (SinghGursimran)
- Add crawl_date to binary DataFrames and imageLinks. #414 (ruebot)
- Various DataFrame implementation updates for documentation clean-up; Addresses #372. #406 (ruebot)
- Use https for maven repo. #405 (ruebot)
- Test clean-up. #404 (ruebot)
- Add language detection column to webpages. #403 (ruebot)
- DataFrame Implementation - Serializable APIs #401 (SinghGursimran)
- Filter blank src/dest out of webgraph. #400 (ruebot)
- More df implementations #399 (SinghGursimran)
- Scala imports cleanup. #398 (ruebot)
- More Serializable APIs for DataFrames #396 (SinghGursimran)
- Update ExtractDateRDD test #395 (ruebot)
- Add doc comments for webpages and webgraph; resolves #392. #394 (ruebot)
- Add additional filters for fextFiles; resolves #362. #393 (ruebot)
- API implementations for DataFrame #391 (SinghGursimran)
- Setup for Serializable APIs on DataFrames #389 (SinghGursimran)
- Add and update tests, resolve textFiles bug. #388 (ruebot)
- Dataframe matchbox Implementations #387 (SinghGursimran)
- Clean-up underscore import, and scalastyle warnings. #386 (ruebot)
- Rename pages() to webpages(). #384 (ruebot)
- More Data Frame Implementations + Code Refactoring #383 (SinghGursimran)
- Extract popular images - Data Frame implementation #382 (SinghGursimran)
- Append UDF with RDD or RF. #381 (ruebot)
- Matchbox utilities to DataFrames #380 (SinghGursimran)
- Rename DF functions to be consistent with Python DF functions. #379 (ruebot)
- Converting output of NER Classifier to WANE Format #378 (SinghGursimran)
- Finding Hyperlinks within Collection on Pages with Certain Keyword #377 (SinghGursimran)
- Update README.md #376 (lintool)
- Fix for Issue-368 #374 (SinghGursimran)
- [skip travis] update description. see https://github.com/archivesunle… #373 (ruebot)
- Various UDF implementation and cleanup for DF #370 (lintool)
- Update commons-compress to 1.19; CVE-2019-12402 #365 (ruebot)
- Add ComputeSHA1 method; resolves #363. #364 (ruebot)
- Align NER output to WANE format #361 (ruebot)
- Update keepValidPages to include a filter on 200 OK. #360 (ruebot)
- Update to Spark 2.4.4 #358 (ruebot)
- [skip travis] Update links #357 (ruebot)
- Improve test coverage. #354 (ruebot)
- Add discardLanguage filter to RecordLoader. #353 (ruebot)
aut 0.18.1
Fix for #407
aut 0.18.0
aut-0.18.0 (2019-08-21)
Implemented enhancements:
- Add method for unknown extensions in binary extractions #343
- Use Tika's detected MIME type instead of ArchiveRecord getMimeType? #342
- Add filter/keep by http status to RecordLoader class #315
- Audio binary object extraction #307
- Video binary object extraction #306
- Powerpoint binary object extraction #305
- Doc binary object extraction #304
- Spreadsheet binary object extraction #303
- PDF binary object extraction #302
- Test aut with Apache Spark 2.4.0 #295
- Replace hashing of unique ids with .zipWithUniqueId() #243
- Integration of neural network models for image analysis #240
- More complete Twitter Ingestion #194
- Image Search Functionality #165
- feature request: log when loadArchives opens and closes warc files in a dir #156
Fixed bugs:
- DataFrame commands throwing java.lang.NullPointerException on example data #320
- Class issues when using aut-0.17.0-fatjar.jar #313
- Image extraction does not scale with number of WARCs #298
- ExtractDomain mistakenly checks source first then url #277
- Improve ExtractDomain to Better Isolate Domains #269
Closed issues:
- Inconsistency in ArchiveRecord.getContentBytes #334
- Rationalize computeHash and ComputeMD5 #333
- Test additional Java versions with TravisCI #324
- Remove Twitter/tweet analysis #322
- Trouble testing s3 connectivity #319
- Depfu Error: No dependency files found #309
- Strategy to deal with conflict between application and Spark distribution dependencies #308
- SaveImageTest.scala should delete saved image file #299
- Remove Deprecated ExtractGraph.scala file for next release. #291
- DetectLanguage.scala: class LanguageIdentifier in package language is deprecated #286
- CVE-2017-7525 -- com.fasterxml.jackson.core:jackson-databind #279
- Maven build warning during release #273
- Improve DataFrameLoader.scala test coverage #265
- Improve package.scala test coverage #263
- Discussion: Idiom for loading DataFrames #231
- DataFrame field names: open thread #229
- DataFrame performance comparison: Scala vs. Python #215
- TweetUtilsTest.scala doesn't test Spark, only underlying json4s library #206
- feature request: ArchiveRecord.archiveFile #164
- feature request: possibility to query about the progress #162
- Update to Apache Tika 1.19.1; security vulnerabilities in 1.12 #131
- Create tests for ExtractGraph.scala #49
- Setup Victims #5
Merged pull requests:
- Update LICENSE and license headers. #351 (ruebot)
- Add binary extraction DataFrames to PySpark. #350 (ruebot)
- Add method for determining binary file extension #349 (jrwiebe)
- Add keep and discard by http status. #347 (ruebot)
- Add office document binary extraction. #346 (ruebot)
- Use version of tika-parsers without a classifier #345 (jrwiebe)
- Use Tika's detected MIME type instead of ArchiveRecord getMimeType. #344 (ruebot)
- Add Audio & Video binary extraction #341 (ruebot)
- Extract PDF #340 (jrwiebe)
- More scalastyle work; addresses #196. #339 (ruebot)
- Replace computeHash with ComputeMD5; resolves #333. #338 (ruebot)
- Update Tika to 1.22; address security alerts. #337 (ruebot)
- Tests #336 (ruebot)
- Make ArchiveRecord.getContentBytes consistent, Resolve #334 #335 (ianmilligan1)
- Enable S3 access #332 (jrwiebe)
- Updates to pom following 0e701b2 #328 (ruebot)
- Move data frame fields names to snake_case. #327 (ruebot)
- Python formatting, and gitignore additions. #326 (ruebot)
- Test Java 8 & 11, and remove OracleJDK; resolves #324. #325 (ruebot)
- Remove Tweet utils. #323 (ruebot)
- Update to Spark 2.4.3 and update Tika to 1.20. #321 (ruebot)
- add image analysis w/ tensorflow #318 (h324yang)
- Makes ArchiveRecordImpl serializable #316 (jrwiebe)
- Resolve cobertura-maven-plugin class issue; resolves #313. #314 (ruebot)
- Update spark-core_2.11 to 2.3.1. #312 (ruebot)
- Log closing of ARC and WARC files, per #156 #301 (jrwiebe)
- Delete saved image file; resolves #299 #300 (jrwiebe)
- Remove Deprecated ExtractGraph app #293 (greebie)
- Add .getHttpStatus and .getFilename to ArchiveRecordImpl class #198 & #164 #292 (greebie)
- Update license headers for #208. #290 (ruebot)
- Change Id generation for graphs from using hashes for urls to using .zipWithUniqueIds() #289 (greebie)
- CVE-2018-11771 update #288 (ruebot)
- CVE-2017-17485 update; follow-on to #281. #287 (ruebot)
- Update Apache Tika - security vulnerabilities; resolves #131. #285 (ruebot)
- [skip travis]...
aut 0.17.0
Change Log
aut-0.17.0 (2018-10-04)
Implemented enhancements:
Fixed bugs:
- AUT exits/dies on java.util.zip.ZipException: too many length or distance symbols #271
- AUT exits/dies on java.util.zip.ZipException: invalid distance too far back #246
- Improve ExtractDomain Normalization #239
- Twitter analysis is broken; see also: json4s/json4s#496 #197
- Prevent encoding errors in PySpark #122
Closed issues:
- Cannot skip bad record while reading warc file #267
- Why did Scalastyle not reject
null
values in TweetUtilTest #255 - Create UDF to combine basic text filtering features #253
- spark-shell --packages "io.archivesunleashed:aut:0.16.0" fails with not_found dependencies #242
- CommandLineAppRunner.scala produces output per WARC instead of combined result. #235
- Extract images out of images DataFrame and store to disk #232
- Before the next release, make sure docker-aut builds on master... or make sure --packages works #227
- DataFrames for image analysis #220
- The attempt to upgrade Spark version to 2.3.0 is not successful #218
- Convert nulls to Option(T) #212
- Bringing Scala DataFrames into PySpark #209
- What is AUT? #208
- Refactor ExtractGraph and assess value of GraphX for producing network graphs #203
- Codify creation of standard derivatives into apps #195
- TweetUtils - support fulltext #192
- Combine UDFs into appropriate objects #187
- Register Scala functions for use in Pyspark #148
- PySpark performance bottlenecks: counting values #130
- Redesign of PySpark DataFrame interface for filtering #120
- Improve RecordLoader.scala test coverage #60
Merged pull requests:
- Patch for #246 & #271: Fix exception error when processing corrupted ARC files #272 (borislin)
- Update Bug report template. #268 (ruebot)
- ExtractBoilerpipeText to remove headers as well. #253 #256 (greebie)
- Add additional tweet fields to TweetUtils; partially address #194. #254 (ruebot)
- Add support for full_text in tweets; resolve #192. #252 (ruebot)
- Get rid of 'filesystem-root relative reference' warning. #251 (ruebot)
- Remove stray characters from example commands. #250 (ruebot)
- Deal with final scalastyle assessments: Issue 212 #249 (greebie)
- Address main scalastyle errors - #196 #248 (greebie)
- Add ExtractGraphX including algorithms for PageRank and Components. Issue 203 #245 (greebie)
- Travis build fixes #244 (ruebot)
- Data frame implementation of extractors. Also added cmd arguments to resolve #235 #236 (TitusAn)
- Save images from dataframe to disk #234 (JWZ2018)
- Add missing dependencies in; addresses #227. #233 (ruebot)
- Code cleanup: ArchiveRecord + impl moved into same Scala file #230 (lintool)
- Add Extract Image Details API #226 (JWZ2018)
- Implement DomainFrequency, DomainGraph and PlainText extractor that can be run from command line #225 (TitusAn)
- Remove duplicate call of keepValidPages #224 (JWZ2018)
- Extract Image Links DF API + Test #221 (JWZ2018)
- Update Apache Spark to 2.3.0; resolves #218 #219 (ruebot)
- Resolve archivesunleashed/docker-aut#17 #217 (ruebot)
- Create issue templates #216 (ruebot)
- Exposing Scala DataFrames in PySpark #214 (lintool)
- Update project description; resolves #208. #211 (ruebot)
- Initial DataFrames merge #210 (lintool)
- Add more instructions on how to use things to the README. #207 (ruebot)
aut 0.16.0
aut 0.15.0
aut-0.15.0 (2018-04-11)
Implemented enhancements:
- Clean-up scaladoc comments #184
Closed issues:
- Rename package io.archivesunleashed.io #188
- Major Refactoring: RecordRDD #180
- Major refactoring: matchbox cleanup #179
- Major refactoring: io.archivesunleashed.spark -> io.archivesunleashed #178
Merged pull requests:
- Improve and clean-up Scaladocs; resolves #184 #193 (ruebot)
- Major refactoring of package structure #189 (lintool)
- make ArchiveRecord a trait #186 (helgeho)
This Change Log was automatically generated by github_changelog_generator