-
Notifications
You must be signed in to change notification settings - Fork 8
6.10 Stealing from Python: the 'extract' method
For the neophyte that I was at the beginning of the 2000s, Python was a real discovery. Here was a language where the compilation did not hinder in any way the flow of thought. The infinitely malleable code allowed a quick exploration of a problem. In a few moments, one could open a file, load data and modify it at will thanks to a concise syntax.
Very concise syntax, I must say
Too concise perhaps...
The first code I examined contained the following line:
sub = s [-10:]
Hem!!!
At the time Internet was far from being an unlimited source of information and despite my efforts and a careful reading of the documentation, this notation resisted my understanding. Of course, for the generations fed since their childhood with Python, the idea that this simple instruction could be problematic may be a source of mockery.
To me, the idea that a negative number could occur in an index was inconceivable. All my experience as a C or Pascal programmer was screaming at me that it was wrong. A negative index could not exist.
Fortunately, the colleague who introduced me to Python quickly explained the idea behind it. A negative index was calculated from the end of the container.
I had already encountered this notation in a previous project in APL, a language which I am sure was known to the members of the ABC team to which Van Rossum contributed.
s ← m[¯10;]
Alas for me, I had not made the connection at the time...
As a matter of fact, good ideas are rarely lost in computer science...
In Lisp, such a concise notation is unfortunately inaccessible. The only solution is to implement an equivalent mechanism in the form of a function.
This function is called extract and it works exactly like the Python notation above...
(extract s -10 0)
Note that unlike Python, negative indexes also include the 0.
We could also write:
(extract s -10 (size s))
But, the execution is much slower in that case...
But why stop there...
In fact, when manipulating strings of characters, extraction is often done by locating a sub-string from which the extraction is done.
Let's imagine that we have the following string: 123[4567]890
and that we want to extract the part between the brackets: [4567]
.
In most programming languages, the first step is to find the position of the '[' and then the position of the ']' and extract our sub-string. right?
# Note: There is certainly a more efficient way in Python that I don't know of...
sub = ""
pbeg = s.find("["])
if pbeg != -1:
pend = s.find("]")
if pend > pbeg:
sub = s [pbeg:pend+1]
If you want to exclude square brackets, you obviously have to manipulate the indexes:
sub = ""
pbeg = s.find("["])
if pbeg != -1:
pend = s.find("]")
if pend > pbeg:
sub = s [pbeg-1:pend]
Note that at each step, we have to test the indexes to check that the elements we are looking for are present.
Of course, we could also use rfind
if we are looking for the last occurrence in the string...
After all nothing prevents us from extending the notation of Python to include directly the strings to search for...
Not only does extract know how to use strings as an index, but it also allows us to specify whether or not these characters should be included in the final sub-string.
We can even play on the find/rfind
distinction.
So you can write directly:
(setq s "123[4567]890")
(extract s "[" "]"); yields "4567".
There are two cases, when a numerical index is given after the string search:
- Index > 0, in this case this index is considered as a number of characters to extract after the sub-string
- Index <= 0, in this case it is a position calculated from the end of the string
(setq s "1234567890")
(extract s "5" 2); yields "67".
(extract s "5" -2); yields "678".
(extract s "5" 0); yields "67890".
By default, the following example will return the string: 4567
, brackets are lost.
(setq s "123[4567]890")
(extract s "[" "]"); yields "4567".
The operator +
when you insert it before the string to search, will keep this string into the final result:
(extract s +"[" +"]"); will give "[4567]".
Of course the +
operator can also be used with an evaluation, provided that it returns a string.
Lisp (extract s + (+ "[" "4") 0) ; will return [4567]890
### Search from the end: `-`
In the same way that a negative numerical index allows to calculate the position of a character from the end of the string, we can use the `-` operator so that the search starts from the end of the string.
```Lisp
(setq s "ab12ab34")
(extract s "ab" 0); yields "12ab34".
; with the operator -
(extract s - "ab" 0); yields "34".
This -+
operator allows you to search our sub-string from the end and keep it in the final result.
(setq s "ab12ab34")
; with the operator -
(extract s - "ab" 0); will return "34".
; with the -+ operator
(extract s -+"ab" 0); will return "ab34".
Maybe a full Wiki page is also a bit too much to describe an instruction as simple and obvious as extract... However, the possibility to consider sub-strings as indexes is incredibly useful.
So useful that when I implemented: tamgu another programming language of mine, which took some inspiration from Python (but also from Haskell and Prolog), this was the very first notation that I introduced.
//This is an example of tamgu code with string indexes in brackets:
string s = "qsjkqdkqs[123]qsdjkdsj";
println(s["[":"]"]);
I use LispE in many projects as a way to filter out textual data from JSON or XML files and this instruction is certainly the one I use the most...