Skip to content

6.10 Stealing from Python: the 'extract' method

Claude Roux edited this page Mar 3, 2021 · 11 revisions

Python

Version française

For the neophyte that I was at the beginning of the 2000s, Python was a real discovery. Here was a language where the compilation did not hinder in any way the flow of thought. The infinitely malleable code allowed a quick exploration of a problem. In a few moments, one could open a file, load data and modify it at will thanks to a concise syntax.

Very concise syntax, I must say

Too concise perhaps...

The first code I examined contained the following line:

sub = s [-10:]

Hem!!!

At the time Internet was far from being an unlimited source of information and despite my efforts and a careful reading of the documentation, this notation resisted my understanding. Of course, for the generations fed since their childhood with Python, the idea that this simple instruction could be problematic may be a source of mockery.

To me, the idea that a negative number could occur in an index was inconceivable. All my experience as a C or Pascal programmer was screaming at me that it was wrong. A negative index could not exist.

Fortunately, the colleague who introduced me to Python quickly explained the idea behind it. A negative index was calculated from the end of the container.

I had already encountered this notation in a previous project in APL, a language which I am sure was known to the members of the ABC team to which Van Rossum contributed.

s  m[¯10;]

Alas for me, I had not made the connection at the time...

As a matter of fact, good ideas are rarely lost in computer science...

extract

In Lisp, such a concise notation is unfortunately inaccessible. The only solution is to implement an equivalent mechanism in the form of a function.

This function is called extract and it works exactly like the Python notation above...

(extract s -10 0)

Note that unlike Python, negative indexes also include the 0.

We could also write:

(extract s -10 (size s))

But, the execution is much slower in that case...

Sub-strings

But why stop there...

In fact, when manipulating strings of characters, extraction is often done by locating a sub-string from which the extraction is done.

Let's imagine that we have the following string: 123[4567]890 and that we want to extract the part between the brackets: [4567].

In most programming languages, the first step is to find the position of the '[' and then the position of the ']' and extract our sub-string. right?

# Note: There is certainly a more efficient way in Python that I don't know of...

sub = ""
pbeg = s.find("["])
if pbeg != -1:
   pend = s.find("]")
   if pend > pbeg:
      sub = s [pbeg:pend+1]

If you want to exclude square brackets, you obviously have to manipulate the indexes:

sub = ""
pbeg = s.find("["])
if pbeg != -1:
   pend = s.find("]")
   if pend > pbeg:
      sub = s [pbeg-1:pend]

Note that at each step, we have to test the indexes to check that the elements we are looking for are present.

Of course, we could also use rfind if we are looking for the last occurrence in the string...

Encapsulated search

After all nothing prevents us from extending the notation of Python to include directly the strings to search for...

Not only does extract know how to use strings as an index, but it also allows us to specify whether or not these characters should be included in the final sub-string.

We can even play on the find/rfind distinction.

So you can write directly:

(setq s "123[4567]890")
(extract s "[" "]"); yields "4567".

Numerical index after the string

There are two cases, when a numerical index is given after the string search:

  • Index > 0, in this case this index is considered as a number of characters to extract after the sub-string
  • Index <= 0, in this case it is a position calculated from the end of the string
(setq s "1234567890")
(extract s "5" 2); yields "67".
(extract s "5" -2); yields "678".
(extract s "5" 0); yields "67890".

Keep or not the characters: +.

By default, the following example will return the string: 4567, brackets are lost.

(setq s "123[4567]890")
(extract s "[" "]"); yields "4567".

The operator + when you insert it before the string to search, will keep this string into the final result:

(extract s +"[" +"]"); will give "[4567]".

Of course the + operator can also be used with an evaluation, provided that it returns a string.

Lisp (extract s + (+ "[" "4") 0) ; will return [4567]890


### Search from the end: `-`

In the same way that a negative numerical index allows to calculate the position of a character from the end of the string, we can use the `-` operator so that the search starts from the end of the string.

```Lisp
(setq s "ab12ab34")
(extract s "ab" 0); yields "12ab34".

; with the operator -
(extract s - "ab" 0); yields "34".

Combining the two operators: -+.

This -+ operator allows you to search our sub-string from the end and keep it in the final result.

(setq s "ab12ab34")

; with the operator -
(extract s - "ab" 0); will return "34".

; with the -+ operator
(extract s -+"ab" 0); will return "ab34".

Conclusion

Maybe a full Wiki page is also a bit too much to describe an instruction as simple and obvious as extract... However, the possibility to consider sub-strings as indexes is incredibly useful.

So useful that when I implemented: tamgu another programming language of mine, which took some inspiration from Python (but also from Haskell and Prolog), this was the very first notation that I introduced.

//This is an example of tamgu code with string indexes in brackets:

string s = "qsjkqdkqs[123]qsdjkdsj";
println(s["[":"]"]);

I use LispE in many projects as a way to filter out textual data from JSON or XML files and this instruction is certainly the one I use the most...

Clone this wiki locally