Node crashes can cause data loss #10933

aphyr · 2015-05-02T02:11:08Z

You're gonna hate me for this one. I apologize in advance.

On Elasticsearch 1.5.0 (Jepsen ecab97547123a0c88bb39ddd3ba3db873dccf251), lein test :only elasticsearch.core-test/create-crash does not affect the network in any way, but instead kills a randomly selected subset of elasticsearch nodes and immediately restarts them, waits for all nodes to recover, then kills a new subset, and so on. As usual, we wait for all nodes to restart and for the cluster status to return green, plus some extra buffer time, before issuing a final read. This failure pattern induces the loss of inserted documents throughout the test: in one particular case, 10% of acknowledged inserts did not appear in the final set.

{:valid? false,
 :lost
 "#{0..49 301..302 307 309 319..322 325 327 334 341 351 370 372 381 405 407 414 416 436 438 447 460..462 475 494 497 499 505 508 513 522..523 526 531 534 537 546 559 564 568 573 588 591 599 614 634 645..646 664 673 679 695 699 702 706 727..728 736 747 753 761 771 775 779 784 788 795 822 827 833 845 862 866 872 876 878 887 892 901 903 923..924 927..928 936 944 968 977..978 983 985 998 1000 1006 1014 1016 1028 1033 1035 1043 1045 1047 1049..1050 1066..1067 1074 1088 1091..1092 1105..1106 1109 1112 1114 1116 1131..1132 1140..1141 1144 1155 1157 1159 1166 1168 1174..1175 1178 1181 1187 1210 1218 1224 1227 1239 1243 1269..1270 1273 1275 1282 1286 1291..1292 1295 1298..1299 1301 1303 1310 1312 1319 1328..1329 1333 1339 1341..1343 1346..1347 1355 1357 1366 1368 1380 1385 1396 1400 1406 1411 1431 1434 1436 1438..1440 1448 1455 1461 1463 1466..1467 1470 1477 1485..1486 1499 1501 1509 1512..1514 1518 1520 1528 1533..1534 1536 1543 1546 1555 1559 1564..1565 1571 1575 1579 1581 1598 1600 1605 1607..1608 1639 1649 1660 1662 1670 1676 1689 1691 1699 1703 1728 1732 1735 1739 1741..1742 1745 1751 1756 1760 1767 1783..1784 1792 1809 1812 1815 1823 1825 1829 1842 1846 1852..1853 1856..1857 1860 1867 1870 1875 1884 1892 1900..1901 1910 1912 1925 1927 1934 1946..1947 1952 1955 1957 1967..1968 1970 1983 1990 1993 1995 1998 2007 2010 2015 2021 2027..2028 2037..2038 2041 2045..2046 2066 2074 2080 2082 2089 2092 2099..2100 2104 2108 2110 2117 2397..2398 2402 2404 2407..2408 2411 2413 2415 2421..2424 2428 2431 2433..2434 2436..2439 2442..2459 2461..2462 2469..2470 2472..2476 2478 2684 2687..2688 2691..2695 2697..2698 2700 2702..2706 2708 2710..2712 2714 2716 2720 2723..2724 2726 2728 2730 2732..2735 2738 2740..2742 2749..2750 2752 2757..2758 2761 2764 2772 2778 2782 2785 2788..2790 2793..2794 2798..2799 2801 2803 2805 2808 2811 2816 2819 2822 2826 2831 2833 2836 2838 2840..2841 2844 2848..2849 2852 2859 2863 2868 2870..2871 2874 2877 2879..2880 2883..2884 2886 2889..2890 2893 2899 2901 2905 2910..2911 2914..2915 2921 2923 2925..2926 2930 2935 2939..2940 2944 2946 2950 2954 2957..2959 2961..2964 2968..2971 2977..2978 2981..2983 2986 2989..2993 2995..2998 3000 3002..3007 3010..3016 3020 3022..3025 3027..3029 3036..3037 3040..3042 3045..3046 3049..3051 3054 3056..3058 3060 3062..3066 3068..3069 3073 3075 3078 3081..3083 3085 3087..3091 3093 3097..3098 3100 3103..3104 3106..3109 3111..3112 3114 3116..3119 3121 3123..3125 3127..3128 3130..3132 3134..3135 3138..3142 3144 3153..3155 3158..3159 3161 3164..3168 3170..3173 3175..3177 3179 3185..3186 3188..3191 3193 3197..3199 3202 3204 3208 3210..3211 3213 3217 3223 3229..3230 3233 3237..3238 3241 3243 3245..3246 3249 3255..3256 3262 3265 3267..3268 3273..3274 3276 3278..3280 3289..3290 3295 3299 3302..3303 3306..3307 3309 3312 3315..3316 3318..3319 3321 3324..3325 3328..3329 3333 3338 3342 3344 3348 3350..3351 3355 3357..3358 3360 3364..3366 3370..3371 3375 3384 3386..3388 3395..3396 3398 3401 3404 3407 3413 3420..3421 3423..3424 3426..3430 3432 3434..3438 3440 3444..3445 3450..3452 3456..3460 3462 3464 3466..3469 3473..3476 3480..3481 3483 3487 3491..3494 3499..3501 3504 3506..3508 3511..3514 3516..3519 3521..3522 3525..3529 3531..3532 3534 3536..3540 3544 3549..3550 3552..3553 3555..3557 3560 3562 3564..3565 3570..3571 3573..3575 3578..3579 3581..3583 3585..3587 3591 3594 3597 3599..3601 3605..3608 3611..3615 3617..3622 3624..3626 3628..3636 3638..3702 6154 6156 6174 6179 6181..6182 6186 6198 6201..6202 6211 6214 6216 6219..6220 6224..6225 6562..6567 6569 6571..6573 6575 6578..6579 6581 6583 6585..6587 6592 6594 6596..6601 6605..6606 6608..6611 6613 6615..6616 6618 6622..6625 6627 6629..6631 6634..6637 9887 9889 9896 9903 9905 9909 9912 9920 9923 9931 9943 9953..9954 9963 9968 9998 10003 10006 10008 10028 10038 10048 10077 10097 10099 10101 10111 10114..10115 10136..10138 10141 10143 10159 10162..10163 10166 10168..10169 10171 10194 10198..10199 10212 10218 10223 10225..10226 10228 10240 10248 10250 10253 10258 10268 10283..10284 10289 10294..10295 10305..10307 10309..10311 10315 10319 10322 10324 10326 10328 10330 10334..10335 10337 10339..10343 10345..10347 10351..10359 10361 10363..10365 10367 10370 10374 10377 10379 10381..10385 10387..10391 10394..10395 10397..10405 10642 10649 10653 10661 10664 10668..10669 10671 10674..10676 10681 10685 10687 10700..10718}",
 :recovered
 "#{2129 2333 2388 2390 2392..2395 2563 2643 2677 2680 2682..2683 4470 4472 4616 4635 4675..4682 4766 4864 4967 5024..5026 5038 5042..5045 5554 5556..5557 5696..5697 5749..5757 5850 5956 6063 6115..6116 6146 6148..6149 6151 6437 6541 6553..6554 6559 6561 11037 11136 11241 11291..11295}",
 :ok
 "#{289..300 303..306 308 310..318 323..324 326 328..333 335..338 340 343..346 348..350 352 354..359 361..363 365..368 371 373..375 377..380 383..387 389..395 397..400 402..404 408..409 411..413 418..423 425..429 431..433 435 437 439 441..443 445..446 448 450..453 455..458 463..464 466..469 471..473 476 478..479 481..487 489 491..493 495 498 500 502..504 507 510..512 515..517 519..520 524..525 528..530 533 535 538..540 542..545 547..548 550..552 554..558 560..563 565..567 569..572 574..587 589..590 592..598 600..613 615..633 635..644 647..663 665..672 674..678 680..694 696..698 700..701 703..705 707..726 729..735 737..746 748..752 754..760 762..770 772..774 776..778 780..783 785..787 789..793 796..798 814..818 820..821 823 825..826 828 830..831 834..836 838..842 844 846..849 851..853 855 857..859 861 863 865 867..868 870..871 873..875 879..880 882..884 886 888 890 893..896 898..900 904..906 908..912 914..915 917..920 922 926 930..931 933..935 937 939..941 943 945 947..950 952..956 958..967 969..976 979..982 984 986..997 999 1001..1005 1007..1013 1015 1017..1027 1029..1032 1034 1036..1042 1044 1046 1048 1051..1065 1068..1073 1075..1087 1089..1090 1093..1104 1107..1108 1110..1111 1113 1115 1117..1130 1133..1136 1138..1139 1142 1145..1148 1150..1154 1158 1160 1162..1164 1167 1169..1171 1173 1177 1179 1182..1185 1188..1190 1192..1194 1196..1200 1202..1205 1207..1208 1211..1213 1215..1217 1220..1223 1226 1228..1229 1231..1236 1238 1240..1242 1245..1247 1249..1250 1252..1254 1256..1258 1260..1261 1263..1267 1271 1276..1278 1280..1281 1283..1284 1287..1288 1290 1293..1294 1296 1300 1304..1307 1309 1311 1313..1314 1316..1318 1321..1326 1330 1332 1334..1337 1340 1345 1348 1350..1351 1353..1354 1356 1358..1365 1367 1369..1379 1381..1384 1386..1395 1397..1399 1401..1405 1407..1410 1412..1430 1432..1433 1435 1437 1441..1447 1449..1454 1456..1460 1462 1464..1465 1468..1469 1471..1476 1478..1484 1487..1498 1500 1502..1508 1510..1511 1515..1517 1519 1521..1527 1529..1532 1535 1537..1542 1544..1545 1547..1554 1556..1558 1560..1563 1566..1570 1572..1574 1576..1578 1580 1582..1597 1599 1601..1604 1606 1609..1617 1619 1621 1623 1625 1627..1629 1632..1633 1636..1637 1641..1643 1645 1647..1648 1651..1652 1654 1657..1658 1663..1664 1666..1667 1671..1672 1675 1678..1679 1681 1683..1684 1687 1690 1693 1695..1696 1701..1702 1706..1710 1713 1715..1717 1721 1723 1725..1726 1730 1733 1736 1738 1746 1748 1752..1753 1759 1763..1765 1768 1770..1772 1775..1776 1778 1780 1785 1788..1790 1794 1796 1798 1800..1801 1804..1805 1807 1811 1814 1818..1820 1826 1828 1832..1833 1836..1841 1843..1845 1847..1851 1854..1855 1858..1859 1861..1866 1868..1869 1871..1874 1876..1883 1885..1891 1893..1899 1902..1909 1911 1913..1924 1926 1928..1933 1935..1945 1948..1951 1953..1954 1956 1958..1966 1969 1971..1982 1984..1989 1991..1992 1994 1996..1997 1999..2006 2008..2009 2011..2014 2016..2020 2022..2026 2029..2036 2039..2040 2042..2044 2047..2065 2067..2073 2075..2079 2081 2083..2088 2090..2091 2093..2098 2101..2103 2105..2107 2109 2111..2116 2118..2126 2129 2333 2388 2390 2392..2395 2399..2401 2403 2405..2406 2409..2410 2412 2414 2416..2420 2425..2427 2429..2430 2432 2435 2440..2441 2460 2463..2468 2471 2477 2563 2643 2677 2680 2682..2683 2685..2686 2689..2690 2696 2699 2701 2707 2709 2713 2715 2717..2719 2721..2722 2725 2727 2729 2731 2736..2737 2746..2747 2753 2755 2760 2762 2765 2767..2769 2773 2775 2777 2783..2784 2796 2804 2810 2814..2815 2818 2820 2823 2828..2829 2837 2846..2847 2850 2855..2857 2861 2864 2867 2876 2892 2894 2896 2898 2902 2904 2907 2909 2917..2918 2920 2927 2931..2932 2936..2937 2943 2947 2951..2952 2956 2960 2965..2967 2972..2976 2979..2980 2984..2985 2987..2988 2994 2999 3001 3008..3009 3017..3019 3021 3026 3030..3035 3038..3039 3043..3044 3047..3048 3052..3053 3055 3059 3061 3067 3070..3072 3074 3076..3077 3079..3080 3084 3086 3092 3094..3096 3099 3101..3102 3105 3110 3113 3115 3120 3122 3126 3129 3133 3136..3137 3143 3145..3152 3156..3157 3160 3162..3163 3169 3174 3178 3180..3184 3187 3192 3194..3195 3205 3207 3215 3219 3221..3222 3225..3226 3236 3242 3247 3251..3252 3258 3260..3261 3272 3283 3285..3286 3292 3294 3296 3301 3311 3314 3320 3331 3334..3335 3339 3341 3345 3353 3361 3369 3374 3376..3377 3380..3382 3390 3394 3399 3402 3405 3408..3409 3412 3414..3415 3418..3419 3422 3425 3431 3433 3439 3441..3443 3446..3449 3453..3455 3461 3463 3465 3470..3472 3477..3479 3482 3484..3486 3488..3490 3495..3498 3502..3503 3505 3509..3510 3515 3520 3523..3524 3530 3533 3535 3541..3543 3545..3548 3551 3554 3558..3559 3561 3563 3566..3569 3572 3576..3577 3580 3584 3588..3590 3592..3593 3595..3596 3598 3602..3604 3609..3610 3616 3623 3627 3637 3946..3949 3952..4009 4012 4015..4016 4018 4020 4022..4024 4026 4029..4031 4034..4035 4037..4038 4040 4042..4043 4045..4047 4050..4052 4055..4058 4061..4062 4065..4067 4070..4071 4074..4076 4078 4080..4081 4084..4085 4087..4088 4090..4091 4093 4095 4097..4098 4100..4101 4104..4105 4107 4109..4110 4113 4115 4117..4118 4121..4123 4126 4128..4129 4131 4133..4135 4137..4138 4140 4142..4144 4147..4148 4150..4151 4154..4156 4159..4160 4162..4163 4165 4167..4168 4171..4174 4177..4179 4181 4183..4184 4187..4189 4191 4194..4195 4198..4200 4203 4205..4206 4208 4211..4213 4215..4216 4218 4220 4222..4223 4225..4470 4472 4551 4553 4616 4618 4623 4627 4631 4635 4675..4763 4766 4864 4967 5024..5026 5038 5042..5098 5100..5103 5105..5110 5112..5114 5116..5121 5123..5124 5126..5129 5131..5136 5138..5140 5142..5146 5148..5153 5155..5160 5162 5164..5169 5171..5175 5177..5179 5181..5185 5187 5189..5193 5195..5200 5202..5205 5207..5210 5212..5214 5216..5221 5223..5224 5226..5228 5230..5232 5234..5235 5237..5239 5241..5246 5248..5249 5251..5252 5254..5257 5259..5260 5262..5264 5266..5267 5269..5274 5277..5280 5282..5288 5290..5296 5298..5301 5303..5305 5307 5309..5554 5556..5557 5635..5636 5696..5697 5749..5846 5850 5956 6063 6115..6116 6146 6148..6149 6151..6153 6155 6157..6173 6175..6178 6180 6183..6185 6187..6197 6199..6200 6203..6210 6212..6213 6215 6217..6218 6221..6223 6226..6228 6437 6541 6553..6554 6559 6561 6568 6570 6574 6576..6577 6580 6582 6584 6588..6591 6593 6595 6602..6604 6607 6612 6614 6617 6619..6621 6626 6628 6632..6633 6880 6882..6883 6893..6945 6947..6953 6955..6958 6960..6965 6967..6969 6971..6977 6979..6981 6983..6986 6988..6991 6993..6996 6998..7001 7003..7006 7008..7010 7012..7015 7017..7021 7023..7024 7026..7030 7032..7034 7036..7038 7040..7041 7043..7046 7048..7050 7052..7053 7055..7056 7058..7060 7062 7064..7067 7069..7073 7075..7079 7081..7084 7086..7088 7090..7091 7093..7099 7101..7104 7106..7108 7110..7114 7116..7119 7121..7126 7128..7133 7135..7140 7142..7145 7147 7149..7153 7155..7395 7398..7399 7401 7403..7405 7408..7410 7413..7415 7419 7421..7422 7424..7427 7430 7432..7433 7436..7437 7440..7443 7446..7447 7449 7451 7453..7455 7459 7461..7462 7464..7465 7467..7468 7471..7473 7476..7477 7479..7482 7485..7486 7488..7489 7491 7493 7495..7496 7498..7500 7503..7505 7507..7508 7510..7511 7514..7515 7518 7520..7521 7524..7525 7528..7530 7533..7534 7536 7538 7540 7542..7543 7546..7547 7549..7550 7553..7556 7559 7561 7563..7566 7570..7571 7573 7576..7578 7580..7581 7584..7585 7587..7588 7590 7592..7595 7598..7599 7602..7604 7607..7612 7615..7887 7889..7890 7921..7923 7926 7929..7932 7934 7937..7939 7941..7942 7945 7947..7948 7951..7953 7956..7958 7961 7963..7964 7966 7968..7970 7973 7975..7976 7978 7980 7982..7984 7986..7987 7990 7992 7994 7996 7998..7999 8001 8003..8005 8008..8010 8012 8014 8017 8019..8020 8022..8023 8025..8026 8028 8030..8031 8033 8035 8037..8038 8040 8042 8044..8045 8047 8049..8050 8053..8055 8058..8059 8061 8063..8064 8066..8245 8248..8249 8251 8282..8284 8286 8288..8290 8292..8293 8295 8297..8299 8301 8304..8306 8308..8309 8312..8313 8315..8318 8321 8323..8325 8328..8330 8332 8334..8336 8338 8341..8342 8344 8346..8347 8349 8351..8352 8354 8356 8358..8359 8361 8363 8365..8366 8368 8370..8371 8374..8376 8378 8381..8382 8384 8386 8388..8389 8391 8393 8395 8397..8399 8402..8404 8407..8410 8413..8414 8416..8418 8420 8422 8424..8425 8428 8430..8432 8434..8614 8616..8620 8622..8626 8628..8631 8633..8636 8638..8640 8642..8643 8645..8649 8651..8655 8657..8661 8664..8668 8671..8676 8678..8681 8683..8688 8690..8694 8696..8699 8701..8704 8706..8709 8712..8715 8717..8720 8722..8725 8728..8732 8734..8739 8741 8743..8746 8748..8750 8752..8756 8758..8760 8762..8767 8769..8771 8773..8777 8779..8782 8784..8788 8790..8794 8796..8800 8802..8803 8805..8809 8811..8813 8815..8818 8820..8824 8826..9079 9081..9084 9086..9091 9093..9098 9100..9105 9107..9108 9110..9113 9115..9119 9121..9124 9126..9130 9132..9134 9136..9140 9142..9144 9146..9151 9153..9157 9159..9163 9165..9166 9168..9169 9171..9175 9177..9181 9183..9186 9188..9190 9192..9195 9197..9200 9202..9207 9209..9215 9217..9221 9223..9228 9230..9234 9236..9241 9243..9245 9247..9248 9250..9254 9256..9259 9261 9263..9268 9270..9272 9274..9277 9279..9283 9285..9289 9291..9293 9295..9564 9567..9569 9602 9605..9606 9608..9609 9612..9614 9617..9619 9622..9624 9628 9630..9633 9636..9637 9639..9640 9642 9644 9646..9647 9650..9652 9654 9656..9657 9659 9661 9663..9664 9666 9669..9671 9673..9674 9676..9677 9680..9681 9683 9685..9686 9688 9690 9692..9694 9696..9697 9699 9701 9704..9706 9710..9712 9714..9715 9717 9721..9722 9724..9725 9727 9729 9731 9733..9734 9737..9738 9740 9742 9744 9746 9748..9886 9888 9890..9895 9897..9902 9904 9906..9908 9910..9911 9913..9919 9921..9922 9924..9927 9929..9930 9933..9936 9938..9939 9941..9942 9944..9946 9948..9949 9951..9952 9955 9957..9960 9962 9964 9966..9967 9969..9971 9973..9976 9978 9980..9984 9986..9991 9993..9995 9997 10000..10002 10005 10007 10010..10013 10015..10017 10019..10022 10024..10026 10029..10033 10035 10037 10039..10040 10042..10045 10047 10049..10050 10052..10056 10058..10059 10061..10066 10068..10071 10073..10076 10079..10082 10084..10085 10087..10091 10093..10094 10096 10098 10102..10103 10105..10107 10109..10110 10113 10117..10122 10124..10127 10129..10130 10132..10134 10140 10142 10144..10158 10160..10161 10164..10165 10167 10170 10172..10193 10195..10197 10200..10211 10213..10217 10219..10222 10224 10227 10229..10239 10241..10247 10249 10251..10252 10254..10257 10259..10267 10269..10282 10285..10288 10290..10293 10296..10304 10308 10312..10314 10316..10318 10320..10321 10323 10325 10327 10329 10331..10333 10336 10338 10344 10348..10350 10360 10362 10366 10368..10369 10371..10373 10375..10376 10378 10380 10386 10392..10393 10396 10641 10644..10645 10648 10650..10652 10654..10660 10662..10663 10665..10667 10670 10672..10673 10677..10680 10682..10684 10686 10688..10699 10964 10966..10967 10969 10972..11035 11037 11136 11241 11291..11299}",
 :recovered-frac 37/5650,
 :unexpected-frac 0,
 :unexpected "#{}",
 :lost-frac 23/226,
 :ok-frac 1463/2825}

Is this actually a bug? It's not entirely clear to me what kind of crash schedules Elasticsearch should actually tolerate, and the docs seem pretty thin. https://www.elastic.co/products/elasticsearch says

Per-Operation Persistence

Elasticsearch puts your data safety first. Document changes are recorded in transaction logs on multiple nodes in the cluster to minimize the chance of any data loss.

But http://www.elastic.co/guide/en/elasticsearch/reference/1.3/index-modules-translog.html suggests that ES only fsyncs the transaction log every five seconds, so maybe there aren't supposed to be any guarantees around retaining the most recent 5 seconds of data? In that case, why advertise transaction logs as a persistence feature?

Maybe this is super naive of me, but I kinda envisioned "putting data safety first" meaning, well, flushing the transaction log before acknowledging a write to the client. That's the Postgres default, and MySQL's default appears to be write and flush the transaction log for every commit as well. Zookeeper blocks writes for fsync, which makes sense, but Riak's leveldb backend and bitcask backend default to not syncing at all, and Cassandra fsyncs on a schedule rather than blocking writes as well.

Thoughts?

The text was updated successfully, but these errors were encountered:

Cybergate9 · 2015-05-02T05:33:36Z

This may not be helpful - and certainly not a technical answer - but personally I've never expected ES to have ACID like features, it's an indexer not a db, to that end I never use it as a primary store, simply an upstream service providing search services over data I've ingested from elsewhere.

kimchy · 2015-05-02T09:50:08Z

The thought process we had is that the most common deployments of ES when its introduced to be used is next to a database, a replay-able upstream, or typically in a logging/metrics use case, where the default expected behavior of ES is to be semi geared towards faster insertion rate by not fsync'ing each operation (though one can configure ES to do so).

I find it similar to the (very early) decision to have by default 2 copies of the data (1 replica), compared to 3 (2 replicas). Most common use cases of ES won't expect a 3x increase in storage. Years ago, even 2x was a pain to explain as a default, since most people didn't expect such a system to have even a single additional copy of data (and was the cause of some blows thrown ES way that it doesn't use Lucene correctly and thats why there is extra cost of storage, yay).

Also, as you mentioned, other systems don't all default to fsync on each write, or block the result until an fsync happened. (on the other hand, there are known issues that need to be fixed regardless of the default, like #7572).

@dakrone / @bleskes lets make sure we run these tests regardless and see if nothing else was uncovered here, and update this issue

You're gonna hate me for this one. I apologize in advance.

Why? :), I personally think its a valid discussion to have, and reopened every once in a while to verify that the defaults chosen (sometimes on inception :) ) still make sense.

sorokod · 2015-05-02T09:58:38Z

personally I've never expected ES to have ACID like features,

I don't think ES promises "ACID like" features, on the other hand they seem not to deliver on what they do promise.

sdekock · 2015-05-02T10:01:09Z

There are better options than fsync:

http://ayende.com/blog/164484/are-you-tripping-on-acid-i-think-you-forgot-something
http://ayende.com/blog/164673/the-difference-between-fsync-write-through-through-the-os-eyes

kimchy · 2015-05-02T10:04:39Z

I don't think ES promises "ACID like" features, on the other hand they seem not to deliver on what they do promise.

It doesn't feel to me that we made a false promise? We didn't claim to have 0 data loss, and are very open around what works and what doesn't in our resiliency status page: http://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html.

sorokod · 2015-05-02T10:28:48Z

The full quote from the ES website is:

Elasticsearch puts your data safety first. Document changes are recorded in transaction logs on multiple nodes in the cluster to minimize the chance of any data loss.

I guess that it's up to the reader to infer from the wording that Pr(any data loss) > 0

VorticonCmdr · 2015-05-02T12:47:58Z

"minimize the chance of any data loss" doesn't sound like a 100% promise to me

sorokod · 2015-05-02T13:16:55Z

Everyone can choose what is to be highlighted, the following is semantically equivalent:

Document changes are recorded in transaction logs on multiple nodes in the cluster so that data loss occurs infrequently.

evantahler · 2015-05-02T17:16:55Z

A question about this test: Does the data loss occur as the shards are being moved/replicated/new master is being elected? If you replicate the shards across all active nodes before the test starts, does the data loss still occur?

xorgy · 2015-05-02T17:18:30Z

It's clearly a problem to acknowledge writes which can go on to fail, regardless of whether you expect data loss(something I find comical) or not. Though I suppose the schedules required to guarantee something like that could prove problematic in some general cases.

That said, even in a search indexing environment, would you want to lose track of even one document in your index? That sounds like a disaster, at least for the next person who has to find it.

To jump on the lexical analysis bandwagon, putting data safety first would mean only acknowledging completed, synced writes; any other situation means that you're putting something other than data safety first, in this case probably write acknowledgement speed; which would be fine if we were frank about it, and even better if we had a strategy to avoid these silent failure cases, at least when we're loading up the database without any particular rush.

jfelectron · 2015-05-02T17:21:21Z

Thanks @aphyr . ACID is beside the point. If you loose 10% of important data that you'd like your users to being to search for that's a breakdown in the contract implicit in an Acknowledged Insert. If you're say an e-retailler are you cool with 10% of you inventory (random set and any given time) being unsearchable?

aphyr · 2015-05-02T17:33:37Z

(I mean, keep in mind that the actual fraction lost is gonna depend on your failure schedule; jepsen tests are intentionally pathological haha)

gregoryyoung · 2015-05-02T18:34:38Z

@sdekock you may be surprised but there are also circumstances where fsync out performs directio. In general though most people align fsync/directio/memmap sync in language even though they are different system calls.

adamcee · 2015-05-03T22:38:21Z

FYI the link to the translog docs are version 1.3 Here are the Elasticsearch docs for version 1.5. The one difference seems to be the transaction log flush threshold has increased.
http://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-translog.html

honzajde · 2015-05-04T11:17:03Z

Btw. I'd like to ask (like total noob in this area), assuming same setup as in the original post, is there any mechanism after crash recovery that notifies you about possible data loss? Does every crash (one node let's say) mean potential data loss? Any standard way to recover from the ES logs only afterwards? Or do I need to put app logs together with the time of crash? I know its many questions - I said I am a noob:)

mdcallag · 2015-05-04T14:46:31Z

I get why fsync-on-commit isn't the default and I am extremely naive about ES but I don't get why the defaults are 512 MB and 5 seconds and not something smaller, like commit at least once per second?
http://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-translog.html

geekpete · 2015-05-05T23:44:16Z

What performance hit would ES take if it was changed to do it "the right way" ?

If the performance hit is small/acceptable, make it a default.
If it's not a small performance hit, make it an option eitherway.

mycrEEpy · 2015-05-06T09:46:58Z

Does the test replicate all shards to all data nodes or only to a subset of nodes in the cluster (for example exactly the subset which will randomly crash)? In my understanding the index operation is synchronous and only returns with an OK if all shards received the document. Therefore if there is no network partition the whole cluster has to crash to loose documents this way if all nodes have a replica. Did i miss something?

This commit makes create, update and delete operations on an index durable by default. The user has the option to opt out to use async translog flushes on a per-request basis by settings `durable=false` on the REST request. Initial benchmarks running on SSDs have show that indexing is about 7% - 10% slower with bulk indexing compared to async translog flushes. This change is orthogonal to the transaction log sync interval and will only sync the transaction log if the operation has not yet been concurrently synced. Ie. if multiple indexing requests are submitted and one operations sync call already persists the operations of others only one sync call is executed. Relates to elastic#10933

This commit makes create, update and delete operations on an index durable by default. The user has the option to opt out to use async translog flushes on a per-index basis by settings `index.translog.durability=request`. Initial benchmarks running on SSDs have show that indexing is about 7% - 10% slower with bulk indexing compared to async translog flushes. This change is orthogonal to the transaction log sync interval and will only sync the transaction log if the operation has not yet been concurrently synced. Ie. if multiple indexing requests are submitted and one operations sync call already persists the operations of others only one sync call is executed. Relates to elastic#10933

Today we are almost intentionally corrupt the translog if we loose a node due to powerloss or similary disasters. In the translog reading code we simply read until we hit an EOF exception ignoring the rest of the translog file once hit. There is no information stored how many records we are expecting or what the last written offset was. This commit restructures the translog to add checkpoints that are written with every sync operation recording the number of synced operations as well as the last synced offset. These checkpoints are also used to identify the actual transaction log file to open instead of relying on directory traversal. This change adds a significant amount of additional checks and pickyness to the translog code. For instance is the translog now associated with a specific engine via a UUID that is written to each translog file as part of it's header. If an engine opens a translog file it was not associated with the operation will fail. Closes to elastic#10933 Relates to elastic#11011

yogirackspace · 2015-05-17T00:52:31Z

If the Elastic Search is going to have random data loss which cannot be detected that poses huge doubt on Elastic Search as a viable technology for any major business use case. Even if we have a primary store and replicate data in Elastic Search and don't have a way to know that there is data loss it would still cause major concerns as we would never get to know when to when to reindex the data. Could more details about the tests be shared. Are we talking an exceptional stress test scenario or should we take it that there is not really any guarantee around retaining the most recent 5 seconds of data.

Also this document seems to be saying otherwise

https://www.elastic.co/guide/en/elasticsearch/guide/master/translog.html#img-xlog-pre-refresh

The purpose of the translog is to ensure that operations are not lost. This begs the question: how safe is the translog?

Writes to a file will not survive a reboot until the file has been fsync'ed to disk. By default, the translog is fsync'ed every 5 seconds. Potentially, we could lose 5 seconds worth of data—if the translog were the only mechanism that we had for dealing with failure.

Fortunately, the translog is only part of a much bigger system. Remember that an indexing request is considered successful only after it has completed on both the primary shard and all replica shards. Even if the node holding the primary shard were to suffer catastrophic failure, it would be unlikely to affect the nodes holding the replica shards at the same time.

While we could force the translog to fsync more frequently (at the cost of indexing performance), it is unlikely to provide more reliability.

If we are willing to take a performance loss what settings need to be tweaked to fsync'ing each operation.

mycrEEpy · 2015-05-17T11:23:26Z

@yogirackspace Check the linked PR above which exactly addresses your problem. It's labled for Elasticsearch 2.0

yogirackspace · 2015-05-17T15:17:34Z

@mycrEEpy Sorry the link is not obvious. Could you repost the link? In case you are referring to the link I gave as per document it is part of 1.4.0

https://www.elastic.co/guide/en/elasticsearch/guide/master/_elasticsearch_version.html

mycrEEpy · 2015-05-17T16:04:23Z

@yogirackspace It's this PR #11011

yogirackspace · 2015-05-17T16:30:27Z

Thanks @mycrEEpy. Could it also be confirmed whether on a real time setup if the node holding the primary shard were to suffer catastrophic failure and gets restarted, in case the replica sets don't go down, the data would not be lost? Just want to be clear about data loss occurrence pattern.

Add translog checkpoints to prevent translog corruption Closes to #10933 Relates to #11011

clintongormley · 2015-05-18T11:41:54Z

@yogirackspace re:

Could it also be confirmed whether on a real time setup if the node holding the primary shard were to suffer catastrophic failure and gets restarted, in case the replica sets don't go down, the data would not be lost? Just want to be clear about data loss occurrence pattern.

This is correct. An indexing request goes through the following process:

Written to the translog on the primary
indexed on the primary
written to the translog on each replica shard
indexed on each replica shard
once all replicas have responded, the request returns to the client

So as long as the replicas remain alive, the change will be persisted on the replica.

s1monw · 2015-05-18T19:45:03Z

Hey @aphyr I decided to not hate you for this but instead overhaul a bit how our translog works as well as how we use it. Apparently there are different expectations on the durability aspects of elasticsearch as well as unclear understanding what an async commit / fsync means in terms of durability guarantees. Long story short here are two problems that cause the dataloss:

currently the translog is buffered and is periodically written to disk. This would be fine if we knew exactly upto which offset it was written and fsynced since then the window would be exactly the interval configured to do an async flush. By default you'd then loose ~5 seconds of data at max. Yet that wasn't the case and the translog was very lenient when we encountered unexpected reads past EOF etc. This has been fixed in Add translog checkpoints to prevent translog corruption #11143
the other problem is the async flush itself, apparently folks expected a sync flush rather than some async process flushing data to disk. Fair enough! The main concern here was performance but after running some bulk indexing benchmarks we decided that a ~7% perf hit on bulk is a good tradeoff for the durability guarantees we gained. When you look at single operation performance I personally think throughput is not so much a concern compared to durability so the perf hit will be higher but it's a good price to pay. I fixed this in Make modifying operations durable by default. #11011 introduces a durability mode that is by default set to REQUEST this means we are not fsyncing on every operation but on every Request for instance a bulk would only be synced once the entire bulk is processed.

Both fixes are only in master and targeted for 2.0 I think it's pretty unlikely to backport those at this point since the next major is reasonably close and the changes are pretty big. I am going to close this issue with target verion 2.0 if you have any question I am happy to answer them.

huntc · 2015-08-26T10:55:15Z

This failure pattern induces the loss of inserted documents throughout the test: in one particular case, 10% of acknowledged inserts did not appear in the final set.

I think that what I'd like to see is ES only return OK when things are indeed OK. If things are not OK then allow that to push back to the client. I'm left wondering if part of this issue is related to ES accepting requests when it is not in a position to process them. WDYT?

clintongormley · 2015-08-26T11:14:26Z

Let's say you have configured an index to have 2 replicas per shard. The write consistency defaults to quorum, so the indexing request will only be accepted if at least the primary and one replica are live. So far so good.

However, let's say that the document is indexed on the primary, then the only live replica dies before it can be indexed on the replica. In 1.x, the indexing request is returned as 200 OK (even though potentially the write on the primary could be lost if eg the primary is isolated). Any further indexing requests would time out until at least one replica shard is available again.

What has changed in 2.x is that indexing requests now return the number of shards that successfully indexed the document (and how many failed), which at least gives you the information to decide whether you want the write to be retried or not.

Note: the default number of replicas per shard is 1, and the quorum requirement is ignored if only 1 replica is configured. Having just two copies of your data (ie the primary and the replica) is sufficient for most use cases, but to make things safer you need at least three (ie two replicas).

henrikno · 2015-09-02T14:15:30Z

I'm still experiencing lost documents when restarting nodes in 2.0.0-beta1.
I'm doing a series of bulk insertions with WriteConsistencyLevel.ALL.
I check both BulkResponse.hasFailures and shardInfo.getFailed(), and if any of them fail I retry that bulk until it succeeds.
I run 3 nodes (minimum master nodes 2) on localhost. While indexing I stop them with Ctrl-C, wait a bit and start them again, wait, and then restart the next node.

Inserted documents: 200000
ES document _count = 191579
Checking documents
Documents Lost: 8421 (found by doing GET on each one)

On version 1.5.2 I also tried running with index.gateway.local.sync: 0 and index.translog.fs.type: simple (according to #5594), but it did not help against lost documents on the "rolling restart"-scenario.
I also tried doing _flush before each bulk is considered success, which helped some because flush fails a lot when nodes go down, but it didn't guarantee no loss.

Test code: https://gist.github.com/henrikno/e0ebd6804cb62491343c

Might there be other things than fsyncing that causes this issue?

jasontedor · 2015-09-08T12:57:28Z

@henrikno Thank you for bringing this issue to our attention. We approach these matters with the utmost seriousness. We are currently taking steps to reproduce, diagnose and resolve the issue that you report. If you have any additional information that you think will be helpful, please send it our way.

bleskes · 2015-09-08T13:48:56Z

@henrikno another quick question to help us go in the same direction you did. How long did you wait after starting a node and before killing the next one? Was the cluster fully recovered from the previous restart? (i.e., green)

jasontedor · 2015-09-08T15:15:37Z

@henrikno Can you clarify the process you followed for stopping nodes in the cluster?

Did you kill all of the nodes, and then restart them one by one?

Or did you kill a node, restart it, and then proceed to the next one? And if this is what you did, see the question that @bleskes asked earlier.

henrikno · 2015-09-08T16:20:03Z

@jasontedor: I first assumed that with WriteConsistencyLevel.ALL I would need to check bulkItemResponse.hasFailures() to detect failures.
Then I added a check for shardInfo.getFailed() > 0 in my test, but it did not help (I assumed successful + failure = total).
But when I also added a check for shardInfo().getSuccessful() != shardInfo.getTotal() it helped a lot, and makes the request fail a lot more, and I haven't been able to lose documents so far with that check.

@bleskes: I did test both (wait on green, wait on join) on 1.5.2 and lost documents with both. With 2.0.0-beta1 I can only remember having reproduced it by only waiting for the node to join (i.e. red cluster), and not checking successfull != total. I do want to test more but I have limited time this week.

If checking successfull == total is a recommended solution for achieving "write consistency all" then that's good, but should be easier/better documented.

bleskes · 2015-09-08T19:16:50Z

@henrikno thanks for the extra info. At the moment we are still trying to understand what you exactly did - so we can try to reproduce. Can you describe the procedure? I'm looking for something like:

Run code in gist.
Kill node with -9.
Start node again.
Wait till node joins cluster/cluster is green.
Repeat from 2 for another node

You mention your cluster was red - which is interesting. In what point did it get red? (when one node is down, it should be yellow, since you have 1 replica in your index).

shikhar · 2015-09-09T02:57:47Z

If checking successfull == total is a recommended solution for achieving "write consistency all" then that's good, but should be easier/better documented.

I've argued this elsewhere, the WriteConsistencyLevel check should be applied in the more intuitive sense, on the # of replicas the request actually succeeded on after processing, rather than # of replicas available to take the write.

bleskes · 2015-09-09T06:51:26Z

@shikhar we are still trying to figure out what exactly happened here. I'm all for discussing WriteConsistencyLevel but let's please do it on the other ticket. This ticket is already complex.

clintongormley added the :Translog label May 4, 2015

clintongormley assigned bleskes May 4, 2015

s1monw mentioned this issue May 6, 2015

Make modifying operations durable by default. #11011

Merged

s1monw mentioned this issue May 13, 2015

Add translog checkpoints to prevent translog corruption #11143

Merged

s1monw added a commit that referenced this issue May 18, 2015

Merge pull request #11143 from elastic/feature/translog_checkpoints

f7696ec

Add translog checkpoints to prevent translog corruption Closes to #10933 Relates to #11011

s1monw added the v2.0.0-beta1 label May 18, 2015

s1monw closed this as completed May 18, 2015

jasontedor mentioned this issue Nov 11, 2015

Do not allow stale replicas to automatically be promoted to primary #14671

Closed

Node crashes can cause data loss #10933

Node crashes can cause data loss #10933

Comments

aphyr commented May 2, 2015

Cybergate9 commented May 2, 2015

kimchy commented May 2, 2015

sorokod commented May 2, 2015

sdekock commented May 2, 2015

kimchy commented May 2, 2015

sorokod commented May 2, 2015

VorticonCmdr commented May 2, 2015

sorokod commented May 2, 2015

evantahler commented May 2, 2015

xorgy commented May 2, 2015

jfelectron commented May 2, 2015

aphyr commented May 2, 2015

gregoryyoung commented May 2, 2015

adamcee commented May 3, 2015

honzajde commented May 4, 2015

mdcallag commented May 4, 2015

geekpete commented May 5, 2015

mycrEEpy commented May 6, 2015

yogirackspace commented May 17, 2015

mycrEEpy commented May 17, 2015

yogirackspace commented May 17, 2015

mycrEEpy commented May 17, 2015

yogirackspace commented May 17, 2015

clintongormley commented May 18, 2015

s1monw commented May 18, 2015

huntc commented Aug 26, 2015

clintongormley commented Aug 26, 2015

henrikno commented Sep 2, 2015

jasontedor commented Sep 8, 2015

bleskes commented Sep 8, 2015

jasontedor commented Sep 8, 2015

henrikno commented Sep 8, 2015

bleskes commented Sep 8, 2015

shikhar commented Sep 9, 2015

bleskes commented Sep 9, 2015