Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to Lucene8 #679

Closed
lintool opened this issue Jun 5, 2019 · 6 comments
Closed

Upgrade to Lucene8 #679

lintool opened this issue Jun 5, 2019 · 6 comments
Assignees

Comments

@lintool
Copy link
Member

lintool commented Jun 5, 2019

Next decision we need to make... upgrade to Lucene8? It's a matter of when.

Lucene8 promises to be faster: https://github.com/castorini/anserini/blob/master/docs/lucene7-vs-lucene8.md

However, Lucene8 yields a whole slew of changes to regression numbers... and will likely break a bunch of things.

My recommendation is to rip the bandaid off, accept that things are going to be broken for a bit, and then gradually fix. EMNLP and CIKM deadlines just passed; WSDM isn't for a while; ACL is far off in the distance.

The direct effects I can foresee are:

  • Ongoing MS MARCO experiments (affecting @rodrigonogueira4 and @Victor0118)
  • Disruption for SIGIR demos (affecting @r-clancy @zeynepakkalyoncu @ljj7975 )
  • OSIRRC can use the v0.5.0 release, so no effect here.
  • BERTserini needs to be retuned (affecting @Victor0118 @LuchenTan); but NAACL demo just passed.
  • Birch effectiveness numbers might change (but for OSIRRC can use v0.5.0 release) - affecting @zeynepakkalyoncu and @Victor0118
  • TREC DL is a bit off, so we can enter with updated version.
  • @emmileaf has a lot of refactoring planned.
  • Lucene8 upgrade will actually make ES integration easier, I think? (@charW and @r-clancy)

Anything else I missed? Downside of a large shared code-base used by the entire group... lots of effects...

cc @Peilin-Yang in case he has an opinion?

Thoughts, everyone? 👍 or 👎 or 🤷‍♂ or 🤷‍♀ ?

@ryan-clancy
Copy link
Member

+1 for ripping the band-aid off, seems like a good time to do it.

  • SIGIR demos - we can use the 0.5.0 release as the new index type in Lucene 8 will likely effect the results
  • anserini-docker is now using 0.5.0
  • ElasticSearch - this should let us use the most recent ES version rather than an older one

@charW
Copy link
Member

charW commented Jun 5, 2019

+1. Using ES 7+ (based on Lucene 8) right now would be pretty cool.

@lintool lintool self-assigned this Jun 6, 2019
@rodrigonogueira4
Copy link
Member

+1. As @lintool mentioned, now seems a good time because there is no deadline in the near future.

@zeynepakkalyoncu
Copy link
Member

+1. I agree it's a good time to introduce a considerable change so that we can work on any issues until the next deadline.

@lintool
Copy link
Member Author

lintool commented Jun 8, 2019

Update: I now have all our existing regressions passing on Lucene8, in a branch called lucene8:

[o] nohup python src/main/python/run_regression.py --index --collection disk12 >& log.disk12 &
[o] nohup python src/main/python/run_regression.py --index --collection robust04 >& log.robust04 &
[o] nohup python src/main/python/run_regression.py --index --collection robust05 >& log.robust05 &
[o] nohup python src/main/python/run_regression.py --index --collection core17 >& log.core17 &
[o] nohup python src/main/python/run_regression.py --index --collection core18 >& log.core18 &

[o] nohup python src/main/python/run_regression.py --index --collection wt10g >& log.wt10g &
[o] nohup python src/main/python/run_regression.py --index --collection gov2 >& log.gov2 &

[o] nohup python src/main/python/run_regression.py --index --collection car17v1.5 >& log.car17v1.5 &
[o] nohup python src/main/python/run_regression.py --index --collection car17v2.0 >& log.car17v2.0 &

[o] nohup python src/main/python/run_regression.py --index --collection mb11 >& log.mb11 &
[o] nohup python src/main/python/run_regression.py --index --collection mb13 >& log.mb13 &

[o] nohup python src/main/python/run_regression.py --index --collection cw09b >& log.cw09b &
[o] nohup python src/main/python/run_regression.py --index --collection cw12b13 >& log.cw12b13 &
[o] nohup python src/main/python/run_regression.py --index --collection cw12 >& log.cw12 &

My proposal is to abandon the JDIQ regressions, because (1) the DRF model that we tested is obsolete
#586 and (2) we need better tuning scripts, improvements on the SIGIR Forum tuning experiments. By "abandon" I mean we're not going to keep these regression numbers up to date any more.

Two more things I want to do: regressions for MS MARCO passage (#690) and doc (#691) - the reason is that I want to compare Lucene7 and Lucene8.

Otherwise, I think we're good to go for merge.

@lintool
Copy link
Member Author

lintool commented Jun 12, 2019

🎉 Lucene 8 has dropped!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants