Skip to content
/ C3S Public

Verifying DLK / CCS method to discover truth concepts in LM

License

Notifications You must be signed in to change notification settings

artkpv/C3S

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CCS on compound sentences (C3S)

Repository for the publication about unsupervised method for lie detection.

Finding internal knowledge representation(s) inside transformer models without supervision is certainly a challenging task which is important for scalable oversight and to mitigate the deception risk factor. We are testing Contrast-Consistent Search (CCS1) on TruthfulQA dataset for compound sentences (conjunction and disjunction) each composed of several answers to a question to see if unsupervised probes work to the same degree as on simple statements that compound ones consist of, with the goal to improve unsupervised methods to discover latent knowledge.

See https://www.lesswrong.com/posts/Lgvw4rFsGcXoyYZbw/ccs-on-compound-sentences

About

Verifying DLK / CCS method to discover truth concepts in LM

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages