-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading fastText models using only bin file #1341
Merged
Merged
Changes from 1 commit
Commits
Show all changes
34 commits
Select commit
Hold shift + click to select a range
7759a95
french wiki issue resolved
c12b4fa
Merge branch 'develop' into french
prakhar2b 8025710
bin and vec mismatch handled
prakhar2b 7ee83d9
updating with lastest codes and resolving conflicts
041a6e9
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
22c6710
added test from bin only loading
61be613
[WIP] loading bin only
e11ac44
word vec from its ngrams
a63a3bc
[WIP] word vec from ngrams
f80410f
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
454d74e
[WIP] getting syn0 from all n-grams
e6b0d8b
[TDD] test comparing word vector from bin_only and default loading
9b03ea3
cleaned up test code
c496be9
added docstring for bin_only
2c4a8dd
Merge branch 'ft_oov_fix' of https://github.com/jayantj/gensim into f…
d2ab903
resolved wiki.fr issue
82507d1
pep8 fixes
c44b958
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
0fc1159
default bin file loading only
f421b05
logging info modified plus changes a/c review
68ec73b
removed unused code in fasttext.py
f7b372e
removed unused codes and vec files from test
5f7fe02
added lee_fasttext vec files again
8bd56cf
re-added removed files and unused codes
b916187
added file name in logging info
1a0bfc0
removing unused load_word2vec_format code
98e0287
updated logging info and comments
f3d2032
input file name with or without .bin both accepted
bd7e7f6
resolved typo mistake
800cd01
test for file name
a15233a
minor change to input filename handling in ft wrapper
jayantj 431aebf
changes to logging and assert messages, pep8 fixes
jayantj e52fee4
removes redundant .vec files
jayantj cebb3fc
fixes utf8 bug in flake8_diff.sh script
jayantj File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,172 @@ | ||
171 2 | ||
ji -0.79132 1.9605 | ||
kter� -0.90811 1.6411 | ||
jen -0.91547 2.0157 | ||
podle -0.64689 1.6221 | ||
zde -0.79732 2.4019 | ||
u� -0.69159 1.7167 | ||
b�t -0.455 1.3266 | ||
v�ce -0.75901 1.688 | ||
bude -0.71114 2.0771 | ||
ji� -0.73027 1.267 | ||
ne� -0.97888 1.8332 | ||
v�s -0.72803 1.6653 | ||
by -0.75761 1.9683 | ||
kter� -0.68791 1.6069 | ||
co -1.0059 1.6869 | ||
nebo -0.94393 1.9611 | ||
ten -0.71975 2.124 | ||
tak -0.80566 2.0783 | ||
m� -0.83065 1.3732 | ||
p�i -0.62158 1.8313 | ||
od -0.44113 1.7755 | ||
po -0.7059 2.2615 | ||
tipy -0.60682 1.7247 | ||
je�t� -0.68854 1.7517 | ||
a� -0.63201 1.4618 | ||
bez -0.52021 1.4513 | ||
tak� -0.67762 1.8138 | ||
pouze -0.62611 1.82 | ||
prvn� -0.42235 1.6216 | ||
va�e -0.7407 1.5659 | ||
kter� -0.70914 1.7359 | ||
n�s -0.38286 1.6016 | ||
nov� -0.83421 1.7609 | ||
jsou -0.82699 1.9694 | ||
pokud -0.35516 1.5075 | ||
m��e -0.78928 1.6357 | ||
strana -0.57276 1.4149 | ||
jeho -0.78568 2.0226 | ||
sv� -0.44488 1.459 | ||
jin� -0.90751 1.9602 | ||
zpr�vy -0.90152 1.9703 | ||
nov� -0.78853 1.8593 | ||
nen� -0.63949 1.5191 | ||
tomu -0.68126 1.8729 | ||
ona -0.74442 1.825 | ||
ono -0.78171 1.9268 | ||
oni -0.64023 2.0525 | ||
ony -0.78142 1.7097 | ||
my -0.61062 1.8857 | ||
vy -0.9356 1.8875 | ||
j� -0.44615 0.92715 | ||
m� -0.73676 1.4089 | ||
mne -0.71006 1.7072 | ||
jemu -0.92237 2.1452 | ||
on -0.71417 1.9224 | ||
t�m -0.65242 1.8779 | ||
t�mu -0.83376 2.054 | ||
n�mu -0.79287 1.8645 | ||
n�mu� -0.51786 1.7297 | ||
jeho� -0.88721 1.7431 | ||
j�� -0.12627 0.68014 | ||
jeliko� -0.61809 1.7576 | ||
je� -0.8843 1.6723 | ||
jako� -0.94336 1.827 | ||
na�e� -0.76919 1.8106 | ||
ze -0.8277 2.0542 | ||
jak -0.97146 1.9164 | ||
dal� -0.5719 1.5148 | ||
ale -0.79733 1.8867 | ||
si -0.61439 1.7134 | ||
se -0.80843 1.8957 | ||
ve -0.7186 1.7891 | ||
to -0.84494 2.3933 | ||
jako -1.1045 2.2656 | ||
za -0.7136 1.9602 | ||
zp�t -0.79965 1.6329 | ||
jejich -0.49038 1.6366 | ||
do -0.69806 1.8364 | ||
pro -0.7878 2.2066 | ||
je -1.1291 3.0005 | ||
na -1.0203 2.4399 | ||
atd -0.70418 1.7405 | ||
atp -0.69278 1.5772 | ||
jakmile -0.87231 1.6896 | ||
p�i�em� -0.64617 1.4417 | ||
j� -0.7135 1.5517 | ||
n�m -0.42164 1.7603 | ||
jej -0.77603 1.9544 | ||
zda -0.76742 2.0163 | ||
pro� -0.47241 1.7053 | ||
m�te -0.75963 1.9814 | ||
tato -0.64318 2.0382 | ||
kam -0.45101 1.498 | ||
tohoto -0.73702 1.8305 | ||
kdo -0.80535 1.8551 | ||
kte�� -0.72498 1.6669 | ||
mi -0.46791 1.7784 | ||
tyto -0.50319 1.7659 | ||
tom -0.59138 1.8657 | ||
tomuto -0.74312 1.7725 | ||
m�t -0.27199 1.1315 | ||
nic -0.56441 1.8591 | ||
proto -0.6649 1.946 | ||
kterou -0.84109 1.7498 | ||
byla -0.58737 1.941 | ||
toho -0.76081 1.8002 | ||
proto�e -0.55749 1.6686 | ||
asi -0.51689 1.7079 | ||
bude� -0.55392 1.6052 | ||
s -0.74207 1.8989 | ||
k -0.61082 2.079 | ||
o -0.76465 1.8956 | ||
i -0.85412 1.6611 | ||
u -0.68535 1.5332 | ||
v -0.73033 1.3855 | ||
z -0.60751 1.9108 | ||
dnes -0.6001 1.7531 | ||
cz -0.59754 1.4239 | ||
t�mto -0.69011 1.6643 | ||
ho -0.55961 1.6968 | ||
budem -0.54027 1.7894 | ||
byli -0.60956 1.793 | ||
jse� -0.63127 1.5972 | ||
m�j -0.48904 1.2814 | ||
sv�m -0.48494 1.8751 | ||
ta -0.78131 2.4286 | ||
tomto -0.60948 1.7083 | ||
tohle -0.74747 1.7907 | ||
tuto -0.74687 1.9464 | ||
neg -0.60997 1.7777 | ||
pod -0.49619 1.914 | ||
t�ma -0.55525 1.6668 | ||
mezi -0.46979 1.3583 | ||
p�es -0.5712 1.9908 | ||
ty -0.78637 2.2804 | ||
pak -0.60084 1.7026 | ||
v�m -0.48545 1.4611 | ||
ani -0.65672 1.7897 | ||
kdy� -0.42318 1.4884 | ||
v�ak -0.60908 1.6867 | ||
�i -0.36843 1.7586 | ||
jsem -0.54047 1.827 | ||
tento -0.64813 1.9799 | ||
�l�nku -0.65578 1.9129 | ||
�l�nky -0.55868 1.8642 | ||
aby -0.80989 1.8384 | ||
jsme -0.60673 1.843 | ||
p�ed -0.53861 2.0502 | ||
pta -0.49464 1.714 | ||
a -0.63056 2.2477 | ||
aj -0.62546 1.6357 | ||
na�i -0.5915 1.6066 | ||
napi�te -0.50964 1.777 | ||
re -0.95733 1.9544 | ||
co� -0.54673 1.6466 | ||
t�m -0.70952 1.8565 | ||
tak�e -0.55439 1.8013 | ||
sv�ch -0.36878 1.4883 | ||
jej� -0.7694 1.6612 | ||
sv�mi -0.63149 2.1581 | ||
jste -0.68444 2.0978 | ||
byl -0.57205 1.7836 | ||
tu -0.88384 2.2256 | ||
tedy -0.62474 2.0469 | ||
teto -0.63187 1.884 | ||
bylo -0.56362 2.0282 | ||
kde -0.7308 2.0316 | ||
ke -0.60918 1.9317 | ||
prav� -0.52626 1.9058 | ||
nad -0.54689 1.8666 | ||
nejsou -0.66814 1.8323 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,172 @@ | ||
171 2 | ||
ji -1.5308 2.0551 | ||
který -0.99211 1.4997 | ||
jen -1.1228 1.3667 | ||
podle -1.1469 1.4473 | ||
zde -1.0191 1.4011 | ||
už -0.91921 1.3531 | ||
být -1.0086 1.4582 | ||
více -1.1058 1.3376 | ||
bude -1.2032 1.7383 | ||
již -1.3136 1.4792 | ||
než -1.0664 1.6635 | ||
vás -1.1113 1.5703 | ||
by -1.1698 1.966 | ||
které -1.1295 1.6275 | ||
co -0.93518 1.1776 | ||
nebo -1.0791 1.5071 | ||
ten -1.1881 1.415 | ||
tak -1.4548 1.8457 | ||
má -1.0658 1.5255 | ||
při -1.3464 1.6107 | ||
od -0.79486 1.5585 | ||
po -1.2758 1.9186 | ||
tipy -0.69335 1.0799 | ||
ještě -0.87116 1.1618 | ||
až -1.2688 1.6518 | ||
bez -0.99627 1.423 | ||
také -1.141 1.4808 | ||
pouze -0.94181 1.4076 | ||
první -1.1166 1.5035 | ||
vaše -0.9672 1.4975 | ||
která -1.1102 1.5806 | ||
nás -1.1328 1.5253 | ||
nový -0.85553 1.1462 | ||
jsou -1.0792 1.8008 | ||
pokud -1.0427 1.3178 | ||
může -1.1269 1.419 | ||
strana -0.84973 1.1957 | ||
jeho -1.1644 1.5879 | ||
své -1.0546 1.6185 | ||
jiné -0.95046 1.2816 | ||
zprávy -0.88762 1.3374 | ||
nové -1.0588 1.619 | ||
není -1.0321 1.5566 | ||
tomu -1.0753 1.5211 | ||
ona -1.21 1.6992 | ||
ono -1.0733 1.6574 | ||
oni -1.1153 1.643 | ||
ony -1.0926 1.5244 | ||
my -0.92689 1.6378 | ||
vy -1.3708 1.8 | ||
jí -1.205 1.6606 | ||
mě -0.96436 1.4713 | ||
mne -1.0956 1.6333 | ||
jemu -1.1181 1.4661 | ||
on -1.0062 1.4124 | ||
těm -0.90732 1.2586 | ||
těmu -0.90621 1.4096 | ||
němu -1.0823 1.4396 | ||
němuž -1.0786 1.3892 | ||
jehož -1.1649 1.4418 | ||
jíž -1.0574 1.6338 | ||
jelikož -1.0449 1.3625 | ||
jež -1.2657 1.7032 | ||
jakož -1.3373 1.6112 | ||
načež -1.0127 1.3696 | ||
ze -1.1784 1.7095 | ||
jak -1.2097 1.5224 | ||
další -0.7288 0.96256 | ||
ale -1.1029 1.4153 | ||
si -1.1097 1.5884 | ||
se -1.2981 1.7707 | ||
ve -1.256 1.7985 | ||
to -1.6894 2.2424 | ||
jako -1.2333 1.5942 | ||
za -1.0376 1.6162 | ||
zpět -0.83657 1.354 | ||
jejich -0.97548 1.4219 | ||
do -0.93685 1.4001 | ||
pro -1.4367 1.9498 | ||
je -1.9446 2.5147 | ||
na -1.5543 2.2901 | ||
atd -0.98175 1.3697 | ||
atp -0.83266 1.1085 | ||
jakmile -1.0954 1.2764 | ||
přičemž -1.0533 1.4279 | ||
já -1.1496 1.4432 | ||
nám -1.0246 1.6043 | ||
jej -1.203 1.6252 | ||
zda -0.93651 1.2363 | ||
proč -0.90395 1.3144 | ||
máte -0.99962 1.4802 | ||
tato -1.3248 1.5575 | ||
kam -0.63468 1.246 | ||
tohoto -0.9737 1.3422 | ||
kdo -0.88982 1.4152 | ||
kteří -0.92973 1.4696 | ||
mi -1.343 1.7217 | ||
tyto -0.99375 1.3067 | ||
tom -1.1636 1.608 | ||
tomuto -1.0103 1.3488 | ||
mít -1.1538 1.6326 | ||
nic -0.76497 1.0685 | ||
proto -1.1781 1.6367 | ||
kterou -1.0561 1.563 | ||
byla -0.9338 1.7033 | ||
toho -1.1263 1.5702 | ||
protože -1.1777 1.4984 | ||
asi -1.0555 1.4401 | ||
budeš -0.98208 1.5432 | ||
s -1.3733 1.6447 | ||
k -1.0223 1.6019 | ||
o -1.4531 1.879 | ||
i -1.0985 1.2956 | ||
u -0.91038 1.6173 | ||
v -1.2536 1.5998 | ||
z -0.96962 1.7437 | ||
dnes -0.92891 1.2478 | ||
cz -0.84461 1.0881 | ||
tímto -0.98475 1.3061 | ||
ho -0.74774 1.4925 | ||
budem -1.0178 1.4333 | ||
byli -0.90776 1.4799 | ||
jseš -1.0297 1.4975 | ||
můj -0.891 1.2674 | ||
svým -1.0586 1.5377 | ||
ta -1.4932 2.0156 | ||
tomto -1.1626 1.5135 | ||
tohle -1.2215 1.6529 | ||
tuto -1.0516 1.3583 | ||
neg -0.94527 1.5529 | ||
pod -1.0601 1.578 | ||
téma -0.93273 1.3456 | ||
mezi -0.96807 1.3465 | ||
přes -1.1927 1.5099 | ||
ty -1.3733 1.7374 | ||
pak -1.0392 1.5592 | ||
vám -0.89801 1.3586 | ||
ani -1.2113 1.5634 | ||
když -1.0124 1.5112 | ||
však -0.75634 1.1299 | ||
či -0.79489 1.2817 | ||
jsem -1.0435 1.4903 | ||
tento -1.0861 1.5053 | ||
článku -0.93302 1.3758 | ||
články -0.98897 1.4387 | ||
aby -1.0874 1.6114 | ||
jsme -1.0547 1.6846 | ||
před -1.0538 1.5186 | ||
pta -1.062 1.6063 | ||
a -1.3116 2.0391 | ||
aj -1.1578 1.5193 | ||
naši -1.2075 1.3714 | ||
napište -1.0436 1.4646 | ||
re -1.3115 1.5453 | ||
což -1.1731 1.3545 | ||
tím -1.0296 1.5885 | ||
takže -1.1014 1.3574 | ||
svých -0.82606 1.1187 | ||
její -1.1029 1.3696 | ||
svými -1.1052 1.4953 | ||
jste -1.1003 1.7465 | ||
byl -0.89449 1.4131 | ||
tu -1.1255 1.5505 | ||
tedy -1.1693 1.6446 | ||
teto -1.2134 1.546 | ||
bylo -0.86091 1.3805 | ||
kde -1.3468 1.7507 | ||
ke -1.0699 1.6688 | ||
pravé -0.9391 1.5172 | ||
nad -1.3404 1.7661 | ||
nejsou -0.85023 1.5033 |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that a load using this method only learns the full-word vectors as in the
.vec
file. If so, isn't it true that the resulting object doesn't have any other capabilities beyond a plainKeyedVectors
? In that case, using a specialized class likeFastTextKeyedVectors
– that maybe is trying to do more, such as ngram-tracking, but inherently is not because that info was lost in the sequence-of-steps used to load it – seems potentially misleading. So unless I'm misunderstanding, I think this load-technique should use a plainKeyedVectors
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this method is not used now for loading using bin only. I removed this unused code, but got a strange flake8 error for python 3+, therefore re-added this for this PR. I'll try removing these unused codes later maybe in a different PR. @gojomo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is an odd error! I suspect it's not really the presence/absence of that method that triggered it, but something else either random or hidden in the whitespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gojomo ok, test passed this time after removing this code 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, this was a bug in the flake8 script, fixed in cebb3fc