Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The analyser introduces an +Ex/V tag and the tokeniser does not remove the “+” sign before tag #4

Open
rueter opened this issue May 12, 2024 · 8 comments
Assignees

Comments

@rueter
Copy link
Member

rueter commented May 12, 2024

cd lang-mrj
make distclean
./autogen.sh && ./configure --enable-tokenisers --enable-morpher 
make 

For some odd reason the analyzer introduces and +Ex/... tag, which is something lang-sms, for example, does not do. +Ex/... tags are brought in by the tokeniser.

hfst-lookup src/fst/analyser-gt-norm.hfstol 
> ӹлӹмӹжӹм
ӹлӹмӹжӹм	ӹлӓш+Ex/V+Der+Der/мЫ+Pass+Prc+A+Sg+Acc+PxSg3+So/PC	0,000000

The tokeniser cannot handle the previously introduced +Ex/V tag. Should the +Ex/V tag be showing up in the analyzer at all?

echo 'ӹлӹмӹжӹм' | hfst-tokenise -g tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst 
"<ӹлӹмӹжӹм>"
	"ӹлӓш"+Ex/V Der Der/мЫ Pass Prc A Sg Acc PxSg3 So/PC <W:0.0>
@Trondtr
Copy link
Contributor

Trondtr commented May 13, 2024

The tag is not declared in root.lexc, that is the problem.
I did it now -- git pull.
I did it for mrj and mns, there may be other lgs where it is needed.

@rueter
Copy link
Member Author

rueter commented May 13, 2024

The tag is not declared in root.lexc, that is the problem. I did it now -- git pull. I did it for mrj and mns, there may be other lgs where it is needed.

This does NOT SOLVE the issue in mrj, where the result of:

echo 'ӹлӹмӹжӹм' | hfst-tokenise -g tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst

is still:

"<ӹлӹмӹжӹм>"
	"ӹлӓш"+Ex/V Der Der/мЫ Pass Prc A Sg Acc PxSg3 So/PC <W:0.0>

@Trondtr
Copy link
Contributor

Trondtr commented May 13, 2024

Ok, this seems to be thing: The filter for removing the tag exists but it not added to the Makefile. I have a look.

uit-mac-443 lang-mns (main)$ grep "rename-POS_before_Der-tags" ../lang-sme/src/fst/Makefile.am
				filters/rename-POS_before_Der-tags.hfst        \
		.o. @\"filters/rename-POS_before_Der-tags.hfst\"      \
					filters/rename-POS_before_Der-tags.%      \
			.o. @\"filters/rename-POS_before_Der-tags.$*\"      \
					filters/rename-POS_before_Der-tags.%      \
			.o. @\"filters/rename-POS_before_Der-tags.$*\"      \
				filters/rename-POS_before_Der-tags.hfst      \
		.o. @\"filters/rename-POS_before_Der-tags.hfst\"      \
		.o. @\"filters/rename-POS_before_Der-tags.hfst\"                  \
uit-mac-443 lang-mns (main)$ grep "rename-POS_before_Der-tags" src/fst/Makefile.am
(nothing)

@Trondtr
Copy link
Contributor

Trondtr commented May 13, 2024

Hmm, it wasn't that easy. The filter was missing in the mns catalogue, but not in the mrj one:

uit-mac-443 lang-mrj (main)$ grep "rename-POS_before_Der-tags" src/fst/Makefile.am
				filters/rename-POS_before_Der-tags.hfst        
		   @\"filters/rename-POS_before_Der-tags.hfst\"      \
				filters/rename-POS_before_Der-tags.hfst        
		   @\"filters/rename-POS_before_Der-tags.hfst\"      \
				filters/rename-POS_before_Der-tags.hfst        
		   @\"filters/rename-POS_before_Der-tags.hfst\"      \
					filters/rename-POS_before_Der-tags.$(1) 
			   @\"filters/rename-POS_before_Der-tags.$(1)\" \

It thus seems I am not on the right track after all. Stay tuned.

@snomos
Copy link
Member

snomos commented May 13, 2024

The tag is not declared in root.lexc, that is the problem. I did it now -- git pull. I did it for mrj and mns, there may be other lgs where it is needed.

The tag is automatically created, and should not be added to root.lexc.

@snomos
Copy link
Member

snomos commented May 13, 2024

cd lang-mrj
make distclean
./autogen.sh && ./configure --enable-tokenisers --enable-morpher 
make 

For some odd reason the analyzer introduces and +Ex/... tag, which is something lang-sms, for example, does not do. +Ex/... tags are brought in by the tokeniser.

lang-sms should do it, as should all languages with a productive derivational system. It has to be added manually for each language, though, IIRC.

The change from POStag to Ex/POStag is done to avoid issues with disambiguation: CG does not care about tag positions, so if a tag string contains first a +V and then an +A tag, both rules for verbs and adjectives will be triggered. By automatically changing all non-final POS tags to the +Ex/xxx format, only the POS tag of the last derivation will be considered by the CG rules, which is exactly what you want in 99% of the cases.

hfst-lookup src/fst/analyser-gt-norm.hfstol 
> ӹлӹмӹжӹм
ӹлӹмӹжӹм	ӹлӓш+Ex/V+Der+Der/мЫ+Pass+Prc+A+Sg+Acc+PxSg3+So/PC	0,000000

The tokeniser cannot handle the previously introduced +Ex/V tag. Should the +Ex/V tag be showing up in the analyzer at all?

This is a separate issue:

echo 'ӹлӹмӹжӹм' | hfst-tokenise -g tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst 
"<ӹлӹмӹжӹм>"
	"ӹлӓш"+Ex/V Der Der/мЫ Pass Prc A Sg Acc PxSg3 So/PC <W:0.0>

All tags should automatically be converted to the CG format, where each + is replaced with a space. Since this does not happen, it might be that the Éx/V tag is not a real tag (a multichar symbol), just a string of individual letters. I would try to figure out exactly where and what is converting the +V to +Ex/V, and see if there is a bug there somewhere.

@flammie
Copy link
Contributor

flammie commented Aug 7, 2024

the tokeniser-disamb-gt-desc uses tags from analyser-disamb-gt-desc to generate the relabeling rules in tools/tokenisers/filters/ and analyser-disamb-gt-desc does not contain +Ex tags.

@snomos
Copy link
Member

snomos commented Aug 8, 2024

@rueter the solution is thus to ensure that the tag renaming script is also applied to the analyser-disamb-gt-desc file, you probably have to add some local/language specific compilation steps for that to happen. The same changes should also be used for other analysers, see how this is done in the Sámi languages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants