我认为这可以满足您的需求。这不是Peter的回答那么简单,因为它使用Python的csv
模块来处理文件。它可能会被重写和简化,只是像他一样将文件视为纯文本,但这应该很容易。
import csv
import re
import sys
csvdictreader = csv.DictReader(sys.stdin, delimiter=',')
csvdictwriter = csv.DictWriter(sys.stdout, fieldnames=csvdictreader.fieldnames, delimiter=',')
csvdictwriter.writeheader()
targets = [name for name in csvdictreader.fieldnames if name.startswith('HLA-')]
for rowfields in csvdictreader:
keep = True
for field in targets:
value = rowfields[field]
if re.match(r'^DQB1\*\d\d$', value): # gene resolution too low?
keep = False
break # quit processing target fields
else: # reduce gene resolution if too high
# by only keeping first two alles if three are present
rowfields[field] = re.sub(r'^DQB1\*(\d\d):(\d\d):(\d\d)$',
r'DQB1*\1:\2', value)
if keep:
csvdictwriter.writerow(rowfields)
对我而言,最难的部分是确定您想做什么。