@rkenmi - String Manipulation in Python 3+

String Manipulation in Python 3+


String Manipulation in Python 3+


Back to Top

Updated on March 3, 2018

There are a few ways to search for a string in Python (3+):

String operations

str.find(sub[, start[, end]])]

"shadow walker".find('w') => 5

This gives you the index of the first match.

While the index might be nice to have, do remember that strings are immutable.

str.replace(old, new[, count])

"ahoy m8y".replace("m8y", "matey") => "ahoy matey"

This allows you to replace any matching substring old to new.

Typically, you want to do this globally, but if you only want this to happen for the first \(n\) occurrences, then you'll need to pass in the argument count = n.

str.split(sep=None, maxsplit=-1)

"a.b.c".split(".")` => `['a','b','c']
"a.b.c".split(".", maxsplit=1)` => `['a','b.c']

When you want a list with entries delimited by some separator.

Note that, the delimiter will not be included in the result. For example, if you do "a.b.c".split("."), then expect the output to be ['a','b','c'] with no "." in sight.

The maxsplit defines the number of max splits. The default -1 means unlimited splits here. If maxsplit \(>=\) 0, after the number of splits are done, it will append the rest of the string onto the output list.

str.splitlines([keepends])

"Hi\nHow are you?\nAre you okay?".splitlines() => ["Hi", "How are you?", "Are you okay?"]

This is super handy if you want to cut out line breaks. Simply call .splitlines with no argument provided. If you do want to keep the line breaks in the list result, set the flag keepends=True.

str.strip([chars])

"A B C D E F G".strip() => "ABCDEFG"

Another quick and easy way to cut out all the spaces just by calling .strip().
The argument chars is similar to how you would put characters into a regular expression's square bracket. Each character in chars is considered for the stripping process, regardless of ordering.

>>> "aaaeeeeddd".strip("de")
'aaa'  
>>> "aaaeeeeddd".strip("ed")
'aaa'  

A more powerful toolset: Regex

The re module provides regular expressions. This gives you more flexibility and control over your string searches and manipulation. The downside is the complicated queries, but its not all that bad.

re.search(pattern, string, flags=0)

match = re.search("[\S]+", "Yoyoyo what is up?")

match.group() => 'Yoyoyo'

This method gives you the first occurrence of the pattern match. If you want to have multiple capture groups, then you'll want to look at re.findall. Returns None if no match is found.

match.group([group1, ...])

re.match(r"\w+ \w+", "Isaac Newton, physicist").group(0) => 'Isaac Newton'

Notice that this is a method from a different object, called Match. These objects symbolize a True Boolean flag for any kind of regex searches. This means that if you get a Match object from a re.search(...), then you have at least 1 or more successful matches (which is good).

match.group() is one of the more common methods of the Match object. It can also be slightly confusing as to what results get returned, and what inputs are expected.

The easiest way to think about it is that the first result will always be the entire string match. No matter what. That is equivalent to match.group(0) or match.group(), since the argument to .group() takes in an index.

There is also a method called match.groups(), that returns a tuple for all the subgroups. match.group() is basically a lookup function that retrieves specific subgroups of match.groups().

You should only expect one value to be in match.group() if your regex does not have capture groups.

If your regex has capture groups, the captured patterns will be added to match.groups. Only then, will you be able to access indices greater than 0 in match.group().

>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.group(0)       # The entire match
'Isaac Newton'  
>>> m.group(1)       # The first parenthesized subgroup.
'Isaac'  
>>> m.group(2)       # The second parenthesized subgroup.
'Newton'  
>>> m.group(1, 2)    # Multiple arguments give us a tuple.
('Isaac', 'Newton')

re.compile(pattern, flags=0)

h = re.compile('hello')

h.match('hello world')

This method is useful when you want to bind regular expression patterns to a reusable variable. It also saves some performance when repeating the same regular expression many, many times.

re.findall(pattern, string, flags=0)

>>> text = "He was carefully disguised but captured quickly by police."
>>> re.findall(r"\w+ly", text)
['carefully', 'quickly']

This method is very useful for searching all occurrences of a pattern. It returns a list with each string in the list representing the captured substring.

re.subn(pattern, repl, string, count=0, flags=0)

>>> re.subn(r'[\d]+', '$', 'ST0N3 0F J0RD4N')
('ST$N$ $F J$RD$N', 5)

>>> re.subn(r'[\s]{2,}', ' ', 'Fix    me up Scotty')
('Fix me up Scotty', 1)

This is basically the regex version of str.replace. This method is similar to re.sub (which is deprecated in Python 3.5) except that it returns a tuple with the first value being the newly modified string, and the second value being the number of substitutions made.


Article Tags:
stringPythonregex