Instrumentation overhaul, fixes #91 #92

utaal · 2017-08-28T14:56:17Z

No description provided.

utaal · 2017-12-15T16:58:01Z

@antiguru is likely to use this code for scaling-related mechanisms, so we went back and tried to clean it up; it seems to make sense+work, with a small amount of overhead (but I have to re-measure). For the scaling, we're thinking of replaying some of the logs in a dataflow in the same computation (as you [@frankmcsherry] proposed some time ago).

frankmcsherry · 2017-12-17T19:23:41Z

src/dataflow/operators/capture/event.rs

+            let mut buffer = self.buffer.borrow_mut();
+            unsafe { ::abomonation::encode(&event, &mut *buffer).unwrap(); }
+            self.stream.borrow_mut().write_all(&buffer[..]).unwrap();
+            buffer.clear();


Can you explain what's going on here? It looks like you have a second buffer into which you stage the serialized data, and then you copy it all out again. Is something else going on with the buffer that I'm missing? (( Also, the RefCell around the writer )). Perhaps this makes sense given constraints somewhere else!

frankmcsherry · 2017-12-17T19:39:25Z

logging/src/lib.rs

+        })
+    };
+    (time::precise_time_ns() as i64 - delta) as u64
+}


Might this be a good time to sort out what notion of time we want to use in these logging channels? Way back when, I floated an alternative that we use a "timely-local" time, roughly "nanoseconds since initial synchronization" rather than "local system clock", which has some good things and some bad things.

Good: it meant that clock skew and such were less problematic (if two computers clocks are off by a minute, the existing approach will stall a bunch of logs, right?).

Bad: it is harder to correlate events in the logging stream with other events that occur in the system, outside of timely (e.g. spikes measured through other tools).

I think the bad thing could in principle be fixed by capturing the machine local time at which the synchronization happens and disseminating this to everyone (playing the role of the PRECISE_TIME_NS_DELTA static). It would mean we need to write down the captured local time, to support correlation with other events that only have local time, but that seems doable.

Also we can ditch the time crate, which is only included in timely for this, I think.

frankmcsherry · 2017-12-17T19:44:44Z

logging/src/lib.rs

+    /*  8 */ GuardedProgress(GuardedProgressEvent),
+    /*  9 */ CommChannels(CommChannelsEvent),
+    /* 10 */ Input(InputEvent),
+}


We've talked a bit about this before, and I'm still not in love with it yet. :) If we treat all of the logging stuff as unstable, then no worries because we can try things out, but I think:

We may want separate streams, so that individual streams can be selectively (and programmatically) enabled for capture. E.g. right now, I could imagine wanting to turn off the Schedule events because they are noisy af.

We may also want to support user-defined logging, at least playing out the fantasies I had with ETW back at MSR. :) If a user has higher-level begin-end events that they want to track, we probably wouldn't want to require a monolithic enum that they extend (perhaps we don't, and this enum gets locked down; no worries then).

Anyhow, just writing these down rather than putting them up as a roadblock!

frankmcsherry · 2017-12-17T19:55:30Z

logging/src/lib.rs

+    }
+
+    pub fn log(&self, l: L) {
+        match self.internal {


Is a particular reason to have the RefCell inside the BufferingLogger and to use &self, versus having the Logger type be Rc<RefCell<BufferingLogger>> and have this take a &mut self? Not judging, just trying to grok whether we are intentionally hiding the potential sharing and creating a risk (e.g. a logger that logs could panic, right?), or if it would work as well with &mut and the RefCell lifted up a level.

frankmcsherry · 2017-12-17T20:07:17Z

logging/src/lib.rs

+                buf.push((ts, setup.clone(), l));
+                if buf.len() >= BUFFERING_LOGGER_CAPACITY {
+                    (*pushers.borrow_mut())(LoggerBatch::Logs(&buf));
+                    buf.clear();


What do you think about the Push and Pull idioms from the communication crate? Maybe this would just be Push, actually, but... Rather than a function that takes a &Vec<_> which is then cleared afterwards, you use a function that takes an Option<&mut Vec<_>>, which allows the recipient to swap the backing memory if they like (e.g. if they just want to stash the Vec in a linked list or something), or just drain out of it, as appropriate.

frankmcsherry · 2017-12-17T21:21:46Z

I'm trying to get my head around the code at the moment, sorry for the delay. It may be better to get an in-person run-down from you folks. I'm reading, but not entirely clear how all the parts fit together at the moment (e.g. what the environment variables are for, can/should they be abstracted out as a configuration, who actually uses which types, etc).

We can either chat about these things on gitter, or just chill for the year. Not sure who is around ETHZ and is active; I wouldn't want any of you to have to work when you should be drinking glühwein.

frankmcsherry · 2017-12-17T23:43:18Z

I mentioned this to @utaal on gitter, but one high-level thought I have about the design is:

It seems like there is a bunch of functionality parallel to what timely dataflow itself does. That is, all the log data get captured and buffered up and pushed at various event writers that may or may not fire them at a TCP stream (looking at FilteredLogManager).

Would it not be simpler and perhaps more tasteful if each of these logs just went in to an EventLink type of structure that does approximately zero work, and from which one can replay the corresponding stream in a timely dataflow if one wants to process it, filter it, capture it to a network socket, that sort of thing? This approach has a bunch of appeal to me in that we get to re-use a lot of the timely infrastructure where appropriate, including the communication threads and whatever robust capture implementations get written (e.g. to file, tcp socket, kafka) without having to hand-roll new logging code.

I think the scary issue here was originally "how to prevent logging from logging itself?", but I think we could finesse this pretty easily if we wanted (e.g. a dataflow() variant that swaps in a bogus LoggerConfig).

Do you all (@utaal, @antiguru) have thoughts on whether and why we might prefer hand-rolling these parts? I could imagine some good reasons, and you may have told me before and I've forgotten, but can we work through that again, then? :)

antiguru · 2017-12-18T08:27:10Z

I like the idea of using as much as possible from the timely communication library to use in the logging infrastructure - the goal here is not to come up with a second implementation of something already provided.

On your second comment: My goal is to use Timely Dataflow to observe a computation, feed that data into a policy and use the output of the policy to decide how to change a computation. This means it would be desirable to have the logging stream as another Stream object and we could actually try to run both policy engine and computation within the same TD instance.

On the side, it seems the current PR has sending logged events on the critical path, which we probably don't want to have (EventWriter::push).

frankmcsherry · 2017-12-18T13:25:23Z

Re: the critical path, this seems legit, and is also currently true ofcapture going to a TCP stream. It probably wouldn't be too hard to spin up a threadpool for "backgound" work that doesn't support non-blocking or low-priority operation. For example the Kafka adapter uses a library (someone else's) that spins up its own thread. Worth looking into to minimize disruption, but if we end up with 100 spare threads for each worker we'll want to rethink a bit (e.g. to green threads, or non-blocking APIs).

Edit: thinking a bit more: the EventLink option should be pretty close to zero-overhead (at least, a circular buffer version) on the critical path. Once you say "hey, I want a dataflow with that and some TCP capture" one could say you've opted out of the "not on the critical path" aspect. That's probably good news? I.e. that we could effectively remove logging from the critical path, and then allow you to re-introduce it if you want (or, spin up a background thread that munches each of the EventLink things).

antiguru · 2017-12-20T13:25:23Z

src/logging.rs

+    // None when the logging stream is closed
+    frontier: Option<Product<RootTimestamp, u64>>,
+    event_pusher: P,
+    _s: ::std::marker::PhantomData<S>,


minor: ::std::marker::PhantomData<(S, E)>

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

Signed-off-by: Moritz Hoffmann <moritz.hoffmann@inf.ethz.ch>

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

utaal · 2018-01-05T16:42:23Z

Latest performance number of pingpong:

master

$ time hwloc-bind socket:1 -- ./target/release/examples/pingpong 5000000 -w 1

real    0m11.848s
user    0m11.768s
sys     0m0.064s

this branch, logging disabled

$ time hwloc-bind socket:1 -- ./target/release/examples/pingpong 5000000 -w 1

real    0m12.809s
user    0m12.736s
sys     0m0.060s

+8.4%

utaal · 2018-01-05T16:43:09Z

From my side, this is ready to review/merge.

frankmcsherry · 2018-01-06T09:05:10Z

Cool, thank you! I think we should land this and then keep working, but I had two quick questions first:

I think capture/event.rs still has the copy regression in it, which I think came from some copy/paste of old code. I can fix it up subsequently, but if you have a chance and can bang that out sweet.
Can you remember where the 8% disabled overhead is coming from? My recollection was that it was in operator::pull_progress with the logic outside the logging conditional, but I thought that got cleaned up. Is there another overhead, or is this just "all those mostly predicted branches"?

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

utaal · 2018-01-08T09:34:08Z

Fixed.
The 8% overhead is about half from the additional sequence numbers (on progress messages, and communication) and ~half from something else (it looked like it may be the additional branch in pull_pointstams but my analysis was fairly inconclusive).

frankmcsherry · 2018-01-08T09:48:00Z

Looks good to me. There are a bunch of things still to do but you have an issue open for them. I may dive in and hack on some bits here and there, but I think we need a bit of experience trying to use it before we make too many binding decisions.

utaal mentioned this pull request Aug 28, 2017

Logging and instrumentation overhaul #91

Closed

utaal force-pushed the new_logging branch from 7489afb to 2479ee4 Compare August 29, 2017 11:50

utaal force-pushed the new_logging branch from 5b0eb89 to 9f3fea5 Compare December 15, 2017 16:53

frankmcsherry reviewed Dec 17, 2017

View reviewed changes

antiguru reviewed Dec 20, 2017

View reviewed changes

utaal force-pushed the new_logging branch 2 times, most recently from 845940a to 1e746d4 Compare December 20, 2017 16:54

anonion0 and others added 14 commits January 4, 2018 14:12

added instrumentation to progress tracking

e0b41b3

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

added 'application' logging channel

8fc9e88

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

logging for timely_communication

68502cf

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

redesigned logging infrastructure

7541cb9

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

use EventWriter/EventReader

cd87a26

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

explain dropping the to_tcp_socket communication logging thread

ed362f1

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

captured streams has implicit default cap

0d237f3

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

new instrumentation

0783bfd

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

fix warnings

fcd3baa

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

various test fixes

cb45ab8

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

fix warnings

3e6d96c

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

api and doc for execute

461f862

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

construct an EventPusher per sender thread

2059472

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

pagerank restored

9e199c7

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

utaal and others added 13 commits January 4, 2018 14:12

better docs, no warnings, and to_shared_tcp_socket

d1a805a

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

logging: Derive Abomonation

dff2ea7

Signed-off-by: Moritz Hoffmann <moritz.hoffmann@inf.ethz.ch>

Bring back EventLink

137374e

Signed-off-by: Moritz Hoffmann <moritz.hoffmann@inf.ethz.ch>

Minor space cleanup

2e262e3

Signed-off-by: Moritz Hoffmann <moritz.hoffmann@inf.ethz.ch>

examples: Receiver cleanup

7870d40

Signed-off-by: Moritz Hoffmann <moritz.hoffmann@inf.ethz.ch>

Simplify design, only support a single subscriber at this level

3248259

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

Add EventPusherTee, remove LogManager

146a882

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

Support rendezvous by exposing setup to EventPusher constructors

3f30400

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

use Option for BufferingLogger

81ece3a

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

avoid copy in BufferingLogger

0fe9577

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

guarded logging

eed258f

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

fix panic, reduce overhead when logging is disabled

d3fce80

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

remove unnecessary clone in BatchLogger

a79c493

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

utaal force-pushed the new_logging branch from 1e746d4 to a79c493 Compare January 4, 2018 13:12

re-enable and fix docstring tests

c0ea60e

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

utaal mentioned this pull request Jan 5, 2018

Logging nice-to-haves #114

Open

7 tasks

utaal added 4 commits January 5, 2018 16:44

flush logs after dataflow construction

acff9ce

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

fix remaining doc todos

84d33f5

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

move BufferingLogger to communication

651cb5f

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

remove logging crate

504a6b6

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

utaal changed the title ~~WIP: Instrumentation overhaul, fixes #91~~ Instrumentation overhaul, fixes #91 Jan 5, 2018

revert perf regression in capture/event

3ff69e9

Signed-off-by: Andrea Lattuada <andrea.lattuada@inf.ethz.ch>

frankmcsherry merged commit 8772012 into TimelyDataflow:master Jan 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instrumentation overhaul, fixes #91 #92

Instrumentation overhaul, fixes #91 #92

utaal commented Aug 28, 2017

utaal commented Dec 15, 2017

frankmcsherry Dec 17, 2017

frankmcsherry Dec 17, 2017 •

edited

Loading

frankmcsherry Dec 17, 2017

frankmcsherry Dec 17, 2017

frankmcsherry Dec 17, 2017 •

edited

Loading

frankmcsherry commented Dec 17, 2017

frankmcsherry commented Dec 17, 2017

antiguru commented Dec 18, 2017

frankmcsherry commented Dec 18, 2017 •

edited

Loading

antiguru Dec 20, 2017

utaal commented Jan 5, 2018

utaal commented Jan 5, 2018

frankmcsherry commented Jan 6, 2018

utaal commented Jan 8, 2018

frankmcsherry commented Jan 8, 2018

Instrumentation overhaul, fixes #91 #92

Instrumentation overhaul, fixes #91 #92

Conversation

utaal commented Aug 28, 2017

utaal commented Dec 15, 2017

frankmcsherry Dec 17, 2017

Choose a reason for hiding this comment

frankmcsherry Dec 17, 2017 • edited Loading

Choose a reason for hiding this comment

frankmcsherry Dec 17, 2017

Choose a reason for hiding this comment

frankmcsherry Dec 17, 2017

Choose a reason for hiding this comment

frankmcsherry Dec 17, 2017 • edited Loading

Choose a reason for hiding this comment

frankmcsherry commented Dec 17, 2017

frankmcsherry commented Dec 17, 2017

antiguru commented Dec 18, 2017

frankmcsherry commented Dec 18, 2017 • edited Loading

antiguru Dec 20, 2017

Choose a reason for hiding this comment

utaal commented Jan 5, 2018

master

this branch, logging disabled

utaal commented Jan 5, 2018

frankmcsherry commented Jan 6, 2018

utaal commented Jan 8, 2018

frankmcsherry commented Jan 8, 2018

frankmcsherry Dec 17, 2017 •

edited

Loading

frankmcsherry Dec 17, 2017 •

edited

Loading

frankmcsherry commented Dec 18, 2017 •

edited

Loading