Recommand · October 14, 2021 0

Replace strings in a list (using re.sub)

I am trying to replace parts of file extensions in a list of files. I would like to be able to loop through items (files), and remove the extensions. I don’t know how to appropriately loop through items in the list when re.sub as the third parameter requires a string. eg. re.sub(pattern, repl, string, count=0, flags=0)

import re

file_lst = ['cats1.fa', 'cats2.fa', 'dog1.fa', 'dog2.fa']
file_lst_trimmed =[]

for file in file_lst:
    file_lst_trimmed = re.sub(r'1.fa', '', file)

The issue arising here is that re.sub expects a string and I want it to loop through a list of strings.

Thanks for any advice!

You can use a list comprehension to construct the new list with the cleaned up files names. \d is the regex to match a single character and $ only matches at the end of the string.

file_lst_trimmed = [re.sub(r'\d\.fa$', '', file) for file in file_lst]

The results:

>>> file_lst_trimmed 
['cats', 'cats', 'dog', 'dog']

You can try this:

import re
file_lst = ['cats1.fa', 'cats2.fa', 'dog1.fa', 'dog2.fa']
final_list = [re.sub('\d+\.\w+$', '', i) for i in file_lst]


['cats', 'cats', 'dog', 'dog']

I prefer to python internal functions rather than importing and using a library if possible. Using regex for such simple task might not be the best way to do it. This approach looks clean.

Try this

file_lst = ['cats1.fa', 'cats2.fa', 'dog1.fa', 'dog2.fa']
file_lst_trimmed =[]
for file in file_lst:

No need for regex, use the standard library os and os.path.splittext for this.

Split the pathname path into a pair (root, ext) such that root + ext
== path, and ext is empty or begins with a period and contains at most one period. Leading periods on the basename are ignored;
splitext(‘.cshrc’) returns (‘.cshrc’, ”).

import os.path

l = ['hello.fa', 'images/hello.png']

[os.path.splitext(filename)[0] for filename in l]


['hello', 'images/hello']

Your loop is actually perfectly fine! There are two other issues.

  1. You’re setting file_lst_trimmed equal to your string every iteration of the loop. You want to use append as in file_lst_trimmed.append("apple").

  2. Your regular expression is '1.fa' when it should really just be '.fa' (assuming you only want to strip .fa extensions).

EDIT: I now see that you also want to remove the last number. In that case, you’ll want '\d+\.fa' (\d is a stand-in for any digit 0-9, and \d+ means a string of digits of any length — so this will remove 10, 11, 13254, etc. The \ before the . is because . is a special character that needs to be escaped.) If you want to remove arbitrary file extensions, you’ll want to put \w+ instead of fa — a string of letters of any length. You might want to check out the documentation for regex.