Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: create SymbolIterator for block parsing #106

Merged
merged 45 commits into from
Oct 2, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
a2ca1d2
feat: create SymbolIterator
mhatzl Sep 21, 2023
998d291
feat: switch block parser to SymbolIterator
mhatzl Sep 21, 2023
f1dc373
feat: add itertools for SymbolIterator
mhatzl Sep 22, 2023
de52811
feat: switch to nesting symbol iterators
mhatzl Sep 22, 2023
5398be0
fix: add prefix line test for symbol iterator
mhatzl Sep 22, 2023
aba8224
feat: simplify iterator nesting parsers
nfejzic Sep 22, 2023
88c3064
Merge branch 'symbol-iterator' of https://github.com/Unimarkup/unimar…
mhatzl Sep 22, 2023
fbefb50
fix: correct heading end closure to detect heading
mhatzl Sep 22, 2023
cd608b3
fix: ignore newlines between elements
mhatzl Sep 22, 2023
32778c9
feat: make end-fn optional for new symbol iterator
mhatzl Sep 22, 2023
1a5c5b0
fix: change end fns to get SymboliterMatcher
mhatzl Sep 22, 2023
b8d430b
fix: remove new_line from SymbolIterRoot
mhatzl Sep 22, 2023
6ad4a8b
fix: remove remaining symbols from tokenize output
mhatzl Sep 23, 2023
c73286f
fix: correct prefix consumption for symbol iterator
mhatzl Sep 23, 2023
27d8d70
fix: fix endless loop in peeking_next()
mhatzl Sep 23, 2023
71171f3
fix: correct iterator length calculation
mhatzl Sep 23, 2023
57f5f72
fix: prevent plain from merging with newline token
mhatzl Sep 23, 2023
16c2a60
fix: implement rendering for whitespace inlines
mhatzl Sep 23, 2023
1df4d76
fix: add comment why reset_peek() is needed
mhatzl Sep 23, 2023
f7cbbf8
fix: update verbatim to work with symbol iterator
mhatzl Sep 23, 2023
6c3c28e
arch: split iterator into multiple files
mhatzl Sep 23, 2023
ee317d2
fix: add documentation for the symbol iterator
mhatzl Sep 24, 2023
0d2c225
feat: add nesting depth to symbol iterator
mhatzl Sep 24, 2023
dd903f5
fix: add EOI symbol to match end as empty line
mhatzl Sep 24, 2023
b74c089
fix: remove EOI symbol for lexer tests
mhatzl Sep 24, 2023
45f4a1f
fix: pin zerovec crate to specific version
mhatzl Sep 24, 2023
8487538
fix: resolve icu dependency problems
mhatzl Sep 24, 2023
e1751f5
feat: update icu to not need any generated data
mhatzl Sep 25, 2023
f31143b
fix: remove crate_authors!() due to clippy warning
mhatzl Sep 25, 2023
3746027
chore: remove lock file from vc after icu bump
mhatzl Sep 25, 2023
f8bab51
fix: add blankline for better readability
mhatzl Sep 29, 2023
b63b902
fix: use `debug_assert!()` instead of `cfg(debug_assertions)`
mhatzl Sep 29, 2023
0ad2063
fix: make peeking_next() more compact
mhatzl Sep 29, 2023
17e1956
fix: use owned Vec to create Paragraph from
mhatzl Sep 29, 2023
b20952f
fix: use `iter::once()` to create end sequence
mhatzl Sep 29, 2023
0dc18ad
fix: remove double dot at end of sentence
mhatzl Sep 29, 2023
85f46ff
fix: map length before unwrap of remaining_symbols
mhatzl Sep 29, 2023
6e12f23
fix: improve comments for SymbolIterator
mhatzl Sep 29, 2023
7235dfb
fix: remove Scanner struct
mhatzl Sep 29, 2023
02c4505
fix: restrict visibility of iterator index fns
mhatzl Sep 29, 2023
0d5c8ab
fix: remove duplicate From<> impls for iterators
mhatzl Sep 29, 2023
d489076
fix: remove *curr* prefix for iterator functions
mhatzl Sep 29, 2023
01a148b
fix: remove *curr* prefix from index in root iterator
mhatzl Sep 29, 2023
d710917
fix: add assert to ensure update done on act parent
mhatzl Sep 29, 2023
a69de7a
chore: merge branch 'main' into symbol-iterator
mhatzl Oct 1, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions commons/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ icu_segmenter = "1.3.0"
icu_locid = "1.3.0"
regex = { version = "1.8.1", optional = true }
insta = { version = "1.29.0", features = ["serde"], optional = true }
itertools = "0.11.0"

[features]
test_runner = ["dep:regex", "dep:once_cell", "dep:insta"]
125 changes: 59 additions & 66 deletions commons/src/scanner/mod.rs
Original file line number Diff line number Diff line change
@@ -1,81 +1,74 @@
//! Scanner and helper types and traits for structurization of Unimarkup input.
//! Functionality, iterators, helper types and traits to get [`Symbol`]s from `&str`.
//! These [`Symbol`]s and iterators are used to convert the input into a Unimarkup document.

use icu_segmenter::GraphemeClusterSegmenter;

pub mod position;
pub mod span;
mod symbol;

use icu_segmenter::GraphemeClusterSegmenter;
use position::{Offset, Position};
pub use symbol::{Symbol, SymbolKind};

#[derive(Debug)]
pub struct Scanner {
segmenter: GraphemeClusterSegmenter,
}

impl Clone for Scanner {
fn clone(&self) -> Self {
let segmenter = GraphemeClusterSegmenter::new();

Self { segmenter }
}
}

impl Default for Scanner {
fn default() -> Self {
let segmenter = GraphemeClusterSegmenter::new();
use position::{Offset, Position as SymPos};
pub use symbol::{iterator::*, Symbol, SymbolKind};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be a good idea to rename Position to SymPos in general, since that's what it actually is 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think SymbolPosition would be better in that case, and I would also change Offset to SymbolOffset.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SymbolPosition is looooong 😆. It does read better though. Both options are fine for me, you're free to choose whatever you find better 👍🏻.

P.S. if you can't choose, then choose randomly 🤣

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we keep the names and move them into the symbol module?
I was thinking about this option, but then scanner becomes a bit useless?
But removing scanner, by moving symbol up did not seem right to me.


Self { segmenter }
}
}

impl Scanner {
pub fn scan_str<'s>(&self, input: &'s str) -> Vec<Symbol<'s>> {
let mut symbols: Vec<Symbol> = Vec::new();
let mut curr_pos: Position = Position::default();
let mut prev_offset = 0;
/// Scans given input and returns vector of [`Symbol`]s needed to convert the input to Unimarkup content.
pub fn scan_str(input: &str) -> Vec<Symbol<'_>> {
let segmenter = GraphemeClusterSegmenter::new();

// skip(1) to ignore break at start of input
for offset in self.segmenter.segment_str(input).skip(1) {
if let Some(grapheme) = input.get(prev_offset..offset) {
let mut kind = SymbolKind::from(grapheme);
let mut symbols: Vec<Symbol> = Vec::new();
let mut curr_pos: SymPos = SymPos::default();
let mut prev_offset = 0;

let end_pos = if kind == SymbolKind::Newline {
Position {
line: (curr_pos.line + 1),
..Default::default()
}
} else {
Position {
line: curr_pos.line,
col_utf8: (curr_pos.col_utf8 + grapheme.len()),
col_utf16: (curr_pos.col_utf16 + grapheme.encode_utf16().count()),
col_grapheme: (curr_pos.col_grapheme + 1),
}
};
// skip(1) to ignore break at start of input
for offset in segmenter.segment_str(input).skip(1) {
if let Some(grapheme) = input.get(prev_offset..offset) {
let mut kind = SymbolKind::from(grapheme);

if curr_pos.col_utf8 == 1 && kind == SymbolKind::Newline {
// newline at the start of line -> Blankline
kind = SymbolKind::Blankline;
let end_pos = if kind == SymbolKind::Newline {
SymPos {
line: (curr_pos.line + 1),
..Default::default()
}
} else {
SymPos {
line: curr_pos.line,
col_utf8: (curr_pos.col_utf8 + grapheme.len()),
col_utf16: (curr_pos.col_utf16 + grapheme.encode_utf16().count()),
col_grapheme: (curr_pos.col_grapheme + 1),
}
};

symbols.push(Symbol {
input,
kind,
offset: Offset {
start: prev_offset,
end: offset,
},
start: curr_pos,
end: end_pos,
});

curr_pos = end_pos;
if curr_pos.col_utf8 == 1 && kind == SymbolKind::Newline {
// newline at the start of line -> Blankline
kind = SymbolKind::Blankline;
}
prev_offset = offset;
}

// last offset not needed, because break at EOI is always available
symbols
symbols.push(Symbol {
input,
kind,
offset: Offset {
start: prev_offset,
end: offset,
},
start: curr_pos,
end: end_pos,
});

curr_pos = end_pos;
}
prev_offset = offset;
}

symbols.push(Symbol {
input,
kind: SymbolKind::EOI,
offset: Offset {
start: prev_offset,
end: prev_offset,
},
start: curr_pos,
end: curr_pos,
});

// last offset not needed, because break at EOI is always available
symbols
}
150 changes: 150 additions & 0 deletions commons/src/scanner/symbol/iterator/matcher.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
//! Contains matcher traits and types used to detect iterator end and strip prefixes.
//! The available matcher traits are implemented for [`SymbolIterator`].

use std::rc::Rc;

use itertools::{Itertools, PeekingNext};

use crate::scanner::SymbolKind;

use super::SymbolIterator;

/// Function type to notify an iterator if an end was reached.
pub type IteratorEndFn = Rc<dyn (Fn(&mut dyn EndMatcher) -> bool)>;

/// Function type to consume prefix sequences of a new line.
pub type IteratorPrefixFn = Rc<dyn (Fn(&mut dyn PrefixMatcher) -> bool)>;

/// Trait containing functions that are available inside the end matcher function.
pub trait EndMatcher {
/// Returns `true` if the upcoming [`Symbol`] sequence is an empty line.
/// Meaning that a line contains no [`Symbol`] or only [`SymbolKind::Whitespace`].
///
/// **Note:** This is also `true` if a parent iterator stripped non-whitespace symbols, and the nested iterator only has whitespace symbols.
///
/// [`Symbol`]: super::Symbol
fn is_empty_line(&mut self) -> bool;

/// Wrapper around [`Self::is_empty_line()`] that additionally consumes the matched empty line.
/// Consuming means the related iterator advances over the matched empty line.
///
/// **Note:** The iterator is only advanced if an empty line is matched.
///
/// **Note:** The empty line is **not** included in the symbols returned by [`SymbolIterator::take_to_end()`].
fn consumed_is_empty_line(&mut self) -> bool;

/// Returns `true` if the given [`Symbol`] sequence matches the upcoming one.
///
/// [`Symbol`]: super::Symbol
fn matches(&mut self, sequence: &[SymbolKind]) -> bool;

/// Wrapper around [`Self::matches()`] that additionally consumes the matched sequence.
/// Consuming means the related iterator advances over the matched sequence.
///
/// **Note:** The iterator is only advanced if the sequence is matched.
///
/// **Note:** The matched sequence is **not** included in the symbols returned by [`SymbolIterator::take_to_end()`].
fn consumed_matches(&mut self, sequence: &[SymbolKind]) -> bool;

/// Returns `true` if the iterator is at the given nesting depth.
///
/// **Note** Use [`SymbolIterator::curr_depth()`] to get the current depth of an iterator.
fn at_depth(&self, depth: usize) -> bool;
}

/// Trait containing functions that are available inside the prefix matcher function.
pub trait PrefixMatcher {
/// Consumes and returns `true` if the given [`Symbol`] sequence matches the upcoming one.
/// Consuming means the related iterator advances over the matched sequence.
///
/// **Note:** The iterator is only advanced if the sequence is matched.
///
/// **Note:** The given sequence must **not** include any [`SymbolKind::Newline`], because matches are only considered per line.
///
/// **Note:** The matched sequence is **not** included in the symbols returned by [`SymbolIterator::take_to_end()`].
///
/// [`Symbol`]: super::Symbol
fn consumed_prefix(&mut self, sequence: &[SymbolKind]) -> bool;
}

impl<'input> EndMatcher for SymbolIterator<'input> {
fn is_empty_line(&mut self) -> bool {
// Note: Multiple matches may be set in the match closure, so we need to ensure that all start at the same index
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an extreme nitpick, but I think convention is to use upper case for NOTE, FIXME, TODO etc. Uppercase versions get highlighted (at least in my editor) 🙈.

You can decide to change this or leave it, just wanted to mention it 👀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know that about NOTE.
But would you also write NOTE in doc-comments?
Usually I write **Note:** in doc-comments, but because normal comments have no formatting, I stayed with Note here, because it felt more consistent.

Would have to look through the code so replace all Note. Maybe keeping it as is for now, and change all in a new PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm good question. Generally I wouldn't write any form of Note: in doc-comments. Would probably just explain it, something like Note that ....

You can decide if and when you want to change this, it's not important part of this PR anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think notes in doc-comments can be useful.
Think of them more like GitHub alerts.

It helps to highlight information that is especially relevant to a user.
Making a section bold does not make it readable, so you use alerts to highlight those sections instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, but just in general keep in mind that it's not necessary most of the time. If something is that important, maybe separate heading is a better option. Otherwise we can just explain it. I also use NOTE: often, but we should probably reserve it for special cases. It kind of loses it's purpose if we over-use it, just want us to be aware of that.

self.reset_peek();

let next = self
.peeking_next(|s| {
matches!(
s.kind,
SymbolKind::Newline | SymbolKind::Blankline | SymbolKind::EOI
)
})
.map(|s| s.kind);

let is_empty_line = if Some(SymbolKind::Newline) == next {
let _whitespaces = self
.peeking_take_while(|s| s.kind == SymbolKind::Whitespace)
.count();

let new_line = self.peeking_next(|s| {
matches!(
s.kind,
SymbolKind::Newline | SymbolKind::Blankline | SymbolKind::EOI
)
});
new_line.is_some()
} else {
next.is_some()
};

is_empty_line
}

fn consumed_is_empty_line(&mut self) -> bool {
let is_empty_line = self.is_empty_line();

if is_empty_line {
self.set_index(self.peek_index()); // To consume peeked symbols
}

is_empty_line
}

fn matches(&mut self, sequence: &[SymbolKind]) -> bool {
// Note: Multiple matches may be set in the match closure, so we need to ensure that all start at the same index
self.reset_peek();

for kind in sequence {
if self.peeking_next(|s| s.kind == *kind).is_none() {
return false;
}
}

true
}

fn consumed_matches(&mut self, sequence: &[SymbolKind]) -> bool {
let matched = self.matches(sequence);

if matched {
self.set_index(self.peek_index()); // To consume peeked symbols
}

matched
}

fn at_depth(&self, depth: usize) -> bool {
self.depth() == depth
}
}

impl<'input> PrefixMatcher for SymbolIterator<'input> {
fn consumed_prefix(&mut self, sequence: &[SymbolKind]) -> bool {
debug_assert!(
!sequence.contains(&SymbolKind::Newline),
"Newline symbol in prefix match is not allowed."
);

self.consumed_matches(sequence)
}
}
Loading
Loading