Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telnet Broken #3108

Closed
TerryE opened this issue May 14, 2020 · 14 comments
Closed

Telnet Broken #3108

TerryE opened this issue May 14, 2020 · 14 comments

Comments

@TerryE
Copy link
Collaborator

TerryE commented May 14, 2020

Expected behaviour

(Copied from @pnkowl: #3080 (comment)) The telnet module should work as expected on dev and master

Actual behaviour

LFS Dev branch is not dead like master, but not as expected either. I tried 3 versions of telnet and none worked. All returned a never ending stream of the following to the putty termial:

Pipe: 0x3fff2f38Pipe: 0x3fff2f38Pipe: 0x3fff2f38Pipe: 0x3fff2f38Pipe: 0x3fff2f38Pipe:

Note that Firmware Dev Branch with "simple" telnet (run without LFS) seems ok. BTW, FTP seems ok
.

Test code

ESPlorer terminal looked like this. Note that the E;M "error" occurred many seconds later, after the putty sesson was "closed")

> print(wifi.sta.getip()) 
192.168.1.17	255.255.255.0	192.168.1.1
> node.flashindex("ftpserver")().createServer("test","12345") 
> node.flashindex("_init")()  
> =LFS
table: 0x3fff01f8
> node.flashindex("telnet")().open() 
> Telnet server started (38528 mem free, 192.168.1.17)
telnet_server now open: 	userdata: 0x3fff23a8
E:M 24592
attempt to call a table value
stack traceback:
	[C]: ?
	[C]: ?

Most simple local telnet version

cannot open init.lua:
> print(uart.setup(0, 115200, 8, 0, 1, 1 ))
115200
wifi.setmode(wifi.STATION); wifi.sta.config(station_cfg[9]) wifi.sta.connect()
> print(wifi.sta.getip())
192.168.1.17    255.255.255.0   192.168.1.1
> dofile("telnet_bare23.lua")
> E:M 304
E:M 4840

telnet.zip

NodeMCU startup banner

NodeMCU 3.0.0.0 built on nodemcu-build.com provided by frightanic.com
        branch: dev
        commit: 5e2ea5a226ae30b066968efd27e4e2510e6cd6f4
        release:
        release DTS: 202005120049
        SSL: false
        build type: float
        LFS: 0x40000 bytes total capacity
        modules: adc,bit,cron,dht,file,gpio,mqtt,net,node,ow,pwm,rtcmem,rtctime,sntp,softuart,struct,tmr,uart,wifi
 build 2020-05-12 14:43 powered by Lua 5.1.4 on SDK 3.0.1-dev(fce080e)

Hardware

Describe which ESP8266/ESP32 device you use and document any special hardware setup
required to reproduce the problem.

@pnkowl
Copy link

pnkowl commented May 15, 2020

Terry, Thanks for splitting this out.

Several things

(1) Given that these symptoms have been separated from #3080, should I learn from this that only if I, the user, can definitively state that differing symptoms have the same root cause, should I keep them together? In this instance, it seemed plausible to me that they were related (e.g. how many undesirable things can happen at the same time), but technically I have no idea. As such, the symptoms should be separate, and optionally referenced as possibly related. Gotcha.

(2) The bits missing from above

Hardware

Describe which ESP8266/ESP32 device you use and document any special hardware setup
required to reproduce the problem.

NodeMCU Ver 0.9 (4M byte, ESP-12)

  • no connections except USB

Other

  • esptool to flash firmware
  • esplorer to upload LFS and other files, as needed

Please Note

@pnkowl
Copy link

pnkowl commented May 16, 2020

It appears that dev branch simple telnet without LFS (memory reserved, nothing flashed) also fails.

Same firmware as above. No LFS flashed.

Symptoms:

  • Sometimes the serial coonsole shows a panic.
  • Sometimes the serial console just sits there.
  • In no instance does the putty terminal show any activity.

Serial Console (Dev Branch)

> station_cfg={}; station_cfg[9]={ssid="ssid",pwd="pass",save=false}; wifi.setmode(wifi.STATION); wifi.sta.config(station_cfg[9]); wifi.sta.connect()
> print(wifi.sta.getip())
192.168.1.18	255.255.255.0	192.168.1.1
> dofile("telnet_bare23.lua")
> =node.heap()
38608
> E:M 304
E:M 304

Code

-- a simple telnet server
s=net.createServer(net.TCP,180)
s:listen(23,function(c)

    function s_output(str)
      if(c~=nil)
        then c:send(str)
        end
      end

    -- re-direct output to function s_output.
    node.output(s_output, 0)

    --like pcall(loadstring(l)), support multiple separate lines
    c:on("receive",function(c,l)
      node.input(l)
      end)

    --unregist redirect output function, output goes to serial
    c:on("disconnection",function(c)
      node.output(nil)
      end)

    print("Welcome to NodeMCU world.")
  end)

@TerryE
Copy link
Collaborator Author

TerryE commented May 16, 2020

@pnkowl, what's wrong with app/lua_examples/telnet_pipe.lua? We've provided examples that work. If you don't want to use them, and use your own version, then work out why they don't work and tell us. If there is a bug then I'll fix it. I spent a couple of hours last night explaining why your Lua pattern matching was so slow. I can't do this every night, not unless you want to pay me an hourly rate 😉

Read the node.output() documentation; the function takes a pipe as a parameter, not a string.

@pnkowl
Copy link

pnkowl commented May 16, 2020

@TerryE, I am trying to help, but I feel you are not feeling it, for what I believe from your perspective are understandable reasons.

Summary

  1. After reading your feedback and looking at the docs, it appears to me that Dev branch breaks Master branch existing user application code and nowhere in the Dev branch node.output() section is this surfaced, not in a caution, not in a note, not in the fine print and not in our dialog so far.
  2. Sometimes I feel encouragement and sometimes not. Figuring out how to best encourage and deploy the skills of staff can make or break a project, especially when volunteers are involved. We could all be doing other stuff, I get that. Although limited, given my posts that you have seen as a guide to my talents, how would you suggest I use my talents within the nodemcu community? Up untl quite recently, I have been quietly working away with what is provided.

Details

How I got to the summary. Please note that the realization that Dev and Master are incompatible did not occur until very close to the bottom.

I did not split this issue out. You did. I thought this might be relevant for the issue with LFS web service, so I added it as a comment there. I did not believe it should stand alone. I believed at the time, that the versions of telnet that I use worked with the dev firmware without LFS. I have explained this elsehere.

If I recall, as part of this ticket, telnet fails for all 3 of the telnet versions I have acquired (kindly developed by others and shared)

Here is another example from the master branch docs found at
https://nodemcu.readthedocs.io/en/master/modules/node/#nodeoutput
which behaves as follows

the response on the serial console
> E:M 4840
Putty window has no reaction.

and please note, there is no other reference on that page to other versions.

-- a simple telnet server
s=net.createServer(net.TCP)
s:listen(2323,function(c)
   con_std = c
   function s_output(str)
      if(con_std~=nil)
         then con_std:send(str)
      end
   end
   node.output(s_output, 0)   -- re-direct output to function s_ouput.
   c:on("receive",function(c,l)
      node.input(l)           -- works like pcall(loadstring(l)) but support multiple separate line
   end)
   c:on("disconnection",function(c)
      con_std = nil
      node.output(nil)        -- un-regist the redirect output function, output goes to serial
   end)
end)

Okay, so let's try the same code using master branch firmware
it works as expected.

So this certainly seems to be a "better" MCVE than say the version that uses fifo, and fifosock and LFS. Have I missed something regarding MCVE?

Perhaps the root problem is the need for "Minimal, Complete, and Verifiable example". This asks for the issuer to attempt to reduce the scope to its smallest extent. I understand this to mean, the smallest amount of code. The example that is most closely aligned with the official documention. To the extent that I can come up with inline code, I should.

I agree with this in principle, I agree with its sentiments in practice.

So I have presented an example, that works with the master branch and fails to perform as expected with the dev branch, and where the example is exactly the code as provided in the docs (albeit master branch).

Read the node.output() documentation; the function takes a pipe as a parameter, not a string.

Not to jump off the cliff, but this quote does not say that dev and master are not compatible with respect to "telnet" functionality. But after inspecting both sets of docs, this is exactly what appears to be happening. Dev breaks Master branch existing user application code and nowhere in the Dev branch node.output() section is this surfaced, not in a caution, not in a note, not in the fine print and not in our dialog so far.

There is at least one lesson here somewhere...

@TerryE
Copy link
Collaborator Author

TerryE commented May 17, 2020

@pnkowl, you are right about me being (unfairly) a little irritated. Sorry. We are in the middle of a major piece of roadmap work, at the moment. We've added the option of a Lua 5.3 runtime engine; this in itself involved a lot of work. Getting the NodeMCU framework and modules to work under both variant APIs a lot more, and we know there are holes in the documentation.

The intent was for me to do the bulk of the complex runtime changes; get a working software base available for everyone to use, and to have a shared effort at tidying up the documentation, but ...

The simple telnet server isn't compatible with node.output() as implemented and documented. Any references to it need removed. The version I referenced works fine with my current working branch.

Also the current issue templates probably aren't a good fit IMO for discussing documentation issues.

@pnkowl
Copy link

pnkowl commented May 17, 2020

@TerryE, thank you for your latest reply.

If I can follow up on the last point you shared

Also the current issue templates probably aren't a good fit IMO for discussing documentation issues.

Okay, then what is?

In my previous life, documentation (the written, non coded word) was the foundation for everything. It described the customer requirements. It was the design spec. It bounded unit test. It was the validation and verification plans. It was the end user documentation. It was the gantt or pert chart showing project dependencies, resource flexibility and allocation, critial paths and where risks needed to be attended to.

If the documentation says the system has x behavoir, then if the customer observes x behavior, the systems is as designed -- even if it turns out that the behavior is less desireable -- at least the behavior is out in the open. From a strict perspective (IMO), bugs are when the mismatch between stated and actual function occurs. Upgrades and/or enhancements are when improvements above and beyond the current spec. are attained.

Today, very little of this formality exists in much that is commercially produced. Customers just have to figure it out by trial and error.

I find it very encouraging that Nodemcu seems not that way. If it were, there would be no way to even have this conversation. So for that, thank you.

If the team currently struggles with this kind of thing, then perhaps that is where my talents would be most beneficial.

@pnkowl
Copy link

pnkowl commented May 17, 2020

@TerryE, one more thing in an effort to show support for you. If, when you are done reading, and you are not feeling it, let me know.

I spent a couple of hours last night explaining why your Lua pattern matching was so slow. I can't do this every night, not unless you want to pay me an hourly rate

I was shocked when I read this, not by the tone (which I can understand), but the underlying truths not really said

  • I can only imagine the sacrifices you make. How can the team make helping more accessible to the folks at the bottom looking up (count me in that group)?
  • How does an issuer tell you how important the issue is to them, or what they are most interested in regarding the details shared? The template does not seem to cover this area of communication.
  • Had I proactively included this information, it would have gone something like this. I have a work around for this (not the telnet issue but the slow LFS pattern matching LFS flashing alters core firmware behavior resulting in panic  #3109), so this is low priority for me. That being said, I have no idea how this may affect others, or how it affects the upcoming release (PR of dev to master). Please use your judgement on how to prioritize even the evaluation of this. From my experience, I know that further down the D&D pipe an issue progresses, the more expensive it is to fix. If this is not a bug, but a known artifact of the design, I would be happy to try to write this up to be included as an example in the LFS documentation. If this is a problem of my own doing, let me know and we are good.
  • As a general rule, everything in each of my implementations is documented and backed up (knock on wood). Each deployed device has a plug compatible backup device. All my formal releases include local code, LFS (both zip and img), firmware and design documentation and all this is archived together. Every release has its own test plan. Everything is version controlled. As such, I hope "never" to be dependent on the nodemcu team for resolution of a critical problem I have.
  • I would prefer you ask before spending hours on me, especially when it does NOT add value to the aggregate project. Your time is too valuable to the team. I was not looking for help to solve the slowness, but simply to make the team aware in case it described a design issue which was previously unknown.
  • I am both proud and happy to be able to "support" those who have a method of receiving donations (e.g. https://nodemcu-build.com/index.php). I know it is not anywhere close to "fair" compensation, but it is a "tangible" way to say thank you. If however, we were to embark on a "time and materials" contract, then writing long notes, or reading them, given all the wonderful communication technology, is likely not the best use of time. That being said, if what transpires is a more rigorous and thoughtful process that yields tangible fruit, then perhaps it is.

@pjsg
Copy link
Member

pjsg commented May 17, 2020

The sad fact about nodemcu is that the majority of the code comes from a small (very small) group of people. In my case, a few years ago, when I wanted to build some apps with nodemcu, I discovered issues with the code. I realized that these issues were not important to other people, so the only solution was to fix them myself. I also needed device drivers, so I wrote those too. I also contribute (mostly small stuff) to a number of other open source projects. There is a huge difference between these projects as to how receptive they are to pull requests. There is one project where PRs are merged within hours, and others where it can take months. Is this frustrating -- yes.

My approach has been to realize that the maintainers are people just like us with views on things 'ought to be'. Then I engage with the maintainers to figure out how they want to interact with contributors, and modify my style appropriately.

@TerryE
Copy link
Collaborator Author

TerryE commented May 17, 2020

A quick answer to @pnkowl. Long one tomorrow. The reality is that we tend to triage new to project posters. Anyone who looks like they are taking the time and effort to engage gets an immediate karma bump whatever the content of the contribution. So I am happy to give my time if the other party is going to do likewise.

@jmd13391
Copy link

I notice many parallelisms between NodeMCU and an old programming language favorite of mine, REXX. It might be worth considering the adoption of the "documentation before implementation" approach used by the REXX Author:

The Design of the REXX Language (page 334)

Documentation before implementation. Each major section of the REXX language was documented and circulated for review before its implementation. These sections were in the form of complete reference documentation that in due course became part of the language reference manual. At the same time, and before implementation, sample programs were written to explore the usability of each proposed new feature.
The benefits of this approach were marked:
• The majority of usability problems were discovered before they became embedded in the language or before any implementation of the language included them.
• The writing of documentation was found to be the most effective way of spotting inconsistencies, ambiguities, or incompleteness in a design.
• The designer did not consider implementation details until the documentation was complete, so as to minimize the implementation's influence upon the language.
• Reference documentation written after implementation is much more likely to be inaccurate or incomplete than that written before implementation. After the documentation has been written, the author is likely to know the implementation too well to write an objective description.

Read through that entire document when you have a few minutes. IMO, there are several morsels of insight contained within that might help make the NodeMCU journey a little easier for all of us. ;-)

@pnkowl
Copy link

pnkowl commented May 18, 2020

@jmd13391, Thank you for this. I happen to identify stongly with where I believe you are coming from, which brings back memories for me. Those were the days when IBM was on the last part of its ascendency and where resources were plentiful.

IMO, there are several morsels of insight contained within that might help make the NodeMCU journey a little easier for all of us. ;-)

I have read the page your referenced, have returned to the beginning, and was about to put pencil to paper, and I realized that I was about to collect "my" morsels. Would it be possible for you to share the several morsles of insight you thought may help? (apart from what you posted here). I continue to have to remind myself to listen more, and then confirm that I understand.

@TerryE
Copy link
Collaborator Author

TerryE commented May 18, 2020

The issue with the old strategy for telnet was that Lua code (and C runtime libraries) could call print() which was calling the node_output() redirector which was then calling the Lua redirector calback using a `lua_call()'

  • This was really fragile because it was really breaking some implicit assumptions in the Lua VM.
  • The call to the redirector was on a per-field basis so print(a,b,c) would generate 6 calls (including 3 for the separators) which would also generate 6 telnet IP packets.
  • If you tried to do any serious printing then this would usually end up overrunning the net stack and restart the ESP.
  • Interactive errors weren't redirected to the telnet session
  • CB errors were not caught and panic the ESP casing the telnet session to hang or drop.

In short these example telnet sources were OK for a carefully created "in principle" demo but were unusable in practice. Various issues including #2033, #2541, #2797, #2802, #2884 have discussed this in the past.

Using a pipe for STDOUT has a lot of advantages. Marshalling and packet splitting is handled automatically. With the new error handling (see #3078) errors get sent to the telnet session and the user can elect to suppress restart on panic. The lua_examples module that we supply works.

  • Perhaps our was that this is in lua_examples/telnet and not lua_modules. It is documented as per our conventions in the folder's README.md. Maybe we should move it to lua_modules, in which case our convention is to move the documentation the docs folder and link it into the online documentation. I also missed that @nwf had renamed the default version from telnet.lua to telnet_pipe.lua. The former is better IMO.

What really confuses me here is that all of our references to "telnet" in our current documentation refer to this lua_examples/telnet version, so I am not sure what is wrong here. AFAIK, all references to obsolete version have been removed.

@pnkowl
Copy link

pnkowl commented May 18, 2020

@TerryE, Very nice concise summary (as far as I can tell, not having tried the new pipe version yet).

I think this gets to the nub of the disconnect (my emphasis)

all of our references to "telnet" in our current documentation refer to this lua_examples/telnet version

  • I see "master" as the current standard
  • I believe @TerryE sees "dev" as the current standard

I can understand this! Dev is where developers spend the bulk of their time. To be effective, they must naturally think in the context of that environment. For developers, "dev" is a fact. For me "dev" is a hope.

Here is an example, of the same symptom. Please understand it is an example worthy of MCVE (e.g. trivial in scope). The backstory may be relevant in developing an understanding of how disconnects develop, so here goes. 20 days ago, I opened my first issue #3080. This issue pretty much stopped my development tasks (which I do using the master branch). I found little things I could do while waiting for a fix, but 10 days ago, I read issue #3095. Issue #3095 touched on a topic in my less fun list of formally documenting the oddities I have seen during my Nodemcu journey. So I did the analysis and opened #3096. I see dev as an extension of master, and unless specified, behaves the same or better. So some of the observations (reported as issues) contain a dev banner. Out come #3108, #3109, #3110. I have others I am working on. I see this work as reporting observations (tbd facts), possibly relevant to others. The following example comes from an issue under the "master" banner, but to me, I am just documenting observations.

Enough backstory, back to the point at hand, can you see the minor disconnect in the following dialog?

posted by @TerryE in #3096 (comment)

What does 
for k,v in pairs(node.getpartitiontable()) do print("%s=0x%x" % {k,v}) end
printout?

posted by @pnkowl in #3096 (comment)

> for k,v in pairs(node.getpartitiontable()) do print("%s=0x%x" % {k,v}) end
stdin:1: attempt to perform arithmetic on a string value
stack traceback:
	stdin:1: in main chunk

> for k,v in pairs(node.getpartitiontable()) do print(k,v) end
lfs_size	262144
spiffs_addr	770048
lfs_addr	507904
spiffs_size	3424256`

posted by @TerryE in #3096 (comment)

**PS**: sorry about using the string % operator: this was introduced in a recent `dev` PR and has not made it into `master` yet.

All good, right? I thought so at the time, but it shows how miscommuncation occurs when our frames of reference are not the same.

I think if master and dev are considered 2 different applications, all is okay, but if they are related, then that which differentiates them should be clearly stated, both the pluses and minuses. In this instance I offer the following.

Dev (soon to be master) contains telnet functionality that is

  • much improved in the following ways... [tl;dr]
  • but it is not compatible with the "current" master nodemcu version <1>
  • caution: this means that current applications using telnet will have to be modified, or lacking that, the module will panic when a client initiates a telnet connection or it might issue a E:M error to the serial console or it may just ignore you entirely, but at no time will the client side show any signs of life.

<1> I am not even comfortable naming "dev" and "master" in this proposed statement. As users, we paste the "NodeMCU startup banner" to communicate what we are using, which as far as I can tell, differentiates them commonly by "dev" and "master". But dev becomes master, so these names are temporal. As such, should we be using other information to ensure clear temporal naming during our discussions during the transition?

Please also note that the 3 bulletted items above are not to pass judgement, but to simply, and as clearly as possible, state the facts I believe are most beneficial to users not intimately involved in the change, and to those that may not be terribly interested in any more detail regarding the change, than what needs to be done to their applications to keep them from breaking.

In terms of getting the word out, I would include this information at the following project "checkpoints"

  • the announcement for users to verify/validate dev in the call for PR adoption to master (or however it is done, having no experience myself). Classically this is part of customer acceptance.
  • the release notes to accompany the new "master"
  • any documentation where the word "telnet" appears
  • any web cloud build service (e.g. firmware, LFS) to warn folks that things have changed (could break currently working applications) and a link to the proper resource. Users must understand that firmware/LFS pairing of master/master or dev/dev is temporal. Since at point in time X, dev becomes/became master, so in effect, matching your old master firmware with a new master LFS will not work (may not work?).

I would also recommend that the current master documentation be made easily accessible for folks that continue to use that as their application development platform while the dust settles on the new.

@TerryE
Copy link
Collaborator Author

TerryE commented May 18, 2020

Somehow the updates to docs/modules/node.md and to the telnet example missed the last cut to master. I don't why. Maybe @marcelstoer can explain, but I guess that one of the committers made a simple whoopsie. We only update master post cut in the case of severe breaking faults where there is no workaround. In this case the workaround is to use the dev telnet_pipe.lua version. So we won't be updating this master

The next cut should fix this. Marcel and I will take an action to check that the documentation is in sync on the next cut.

I really don't know why it has to take 10,000 words of dialogue to go over the same ground what seems to be about 5 times. Marcel and I will take an action to check that the documentation is in sync on the next cut.

@TerryE TerryE closed this as completed May 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants