Extract information between square brackets for line in text file starting with >

I have many text files which look like this:

>CAA97360; SPAC26F1.03 [SCHPO]

M-----FRTCTKIGTVPKVLVNQKGLIDGLRRVTTDATTSRANPAHVPEEHDKPFPVKLD

DSVFEGYKIDVPSTEIEVTKGELLGLYEKMVTIRRLELACDALYKAKKIRGFCHLSIGQE

I want to extract the information between the square brackets and retain the M----FRT etc sequence below. So I want the text to look like this:

>SCHPO

M-----FRTCTKIGTVPKVLVNQKGLIDGLRRVTTDATTSRANPAHVPEEHDKPFPVKLD

DSVFEGYKIDVPSTEIEVTKGELLGLYEKMVTIRRLELACDALYKAKKIRGFCHLSIGQE

How could I do this using the terminal please?

asked Nov 20 '18 at 17:50

Katie_C94

If it has to be bash scripting, I suggest you look into using AWK. There's a good tutorial here: cyberciti.biz/faq/bash-scripting-using-awk

– ahota
Nov 20 '18 at 18:00

add a comment |

I have many text files which look like this:

>CAA97360; SPAC26F1.03 [SCHPO]

M-----FRTCTKIGTVPKVLVNQKGLIDGLRRVTTDATTSRANPAHVPEEHDKPFPVKLD

DSVFEGYKIDVPSTEIEVTKGELLGLYEKMVTIRRLELACDALYKAKKIRGFCHLSIGQE

I want to extract the information between the square brackets and retain the M----FRT etc sequence below. So I want the text to look like this:

>SCHPO

M-----FRTCTKIGTVPKVLVNQKGLIDGLRRVTTDATTSRANPAHVPEEHDKPFPVKLD

DSVFEGYKIDVPSTEIEVTKGELLGLYEKMVTIRRLELACDALYKAKKIRGFCHLSIGQE

How could I do this using the terminal please?

asked Nov 20 '18 at 17:50

Katie_C94

If it has to be bash scripting, I suggest you look into using AWK. There's a good tutorial here: cyberciti.biz/faq/bash-scripting-using-awk

– ahota
Nov 20 '18 at 18:00

add a comment |

I have many text files which look like this:

>CAA97360; SPAC26F1.03 [SCHPO]

M-----FRTCTKIGTVPKVLVNQKGLIDGLRRVTTDATTSRANPAHVPEEHDKPFPVKLD

DSVFEGYKIDVPSTEIEVTKGELLGLYEKMVTIRRLELACDALYKAKKIRGFCHLSIGQE

I want to extract the information between the square brackets and retain the M----FRT etc sequence below. So I want the text to look like this:

>SCHPO

M-----FRTCTKIGTVPKVLVNQKGLIDGLRRVTTDATTSRANPAHVPEEHDKPFPVKLD

DSVFEGYKIDVPSTEIEVTKGELLGLYEKMVTIRRLELACDALYKAKKIRGFCHLSIGQE

How could I do this using the terminal please?

asked Nov 20 '18 at 17:50

Katie_C94

I have many text files which look like this:

>CAA97360; SPAC26F1.03 [SCHPO]

M-----FRTCTKIGTVPKVLVNQKGLIDGLRRVTTDATTSRANPAHVPEEHDKPFPVKLD

DSVFEGYKIDVPSTEIEVTKGELLGLYEKMVTIRRLELACDALYKAKKIRGFCHLSIGQE

I want to extract the information between the square brackets and retain the M----FRT etc sequence below. So I want the text to look like this:

>SCHPO

M-----FRTCTKIGTVPKVLVNQKGLIDGLRRVTTDATTSRANPAHVPEEHDKPFPVKLD

DSVFEGYKIDVPSTEIEVTKGELLGLYEKMVTIRRLELACDALYKAKKIRGFCHLSIGQE

How could I do this using the terminal please?

bash command-line terminal

asked Nov 20 '18 at 17:50

Katie_C94

asked Nov 20 '18 at 17:50

Katie_C94

asked Nov 20 '18 at 17:50

Katie_C94

asked Nov 20 '18 at 17:50

Katie_C94

asked Nov 20 '18 at 17:50

Katie_C94

If it has to be bash scripting, I suggest you look into using AWK. There's a good tutorial here: cyberciti.biz/faq/bash-scripting-using-awk

– ahota
Nov 20 '18 at 18:00

add a comment |

If it has to be bash scripting, I suggest you look into using AWK. There's a good tutorial here: cyberciti.biz/faq/bash-scripting-using-awk

– ahota
Nov 20 '18 at 18:00

If it has to be bash scripting, I suggest you look into using AWK. There's a good tutorial here: cyberciti.biz/faq/bash-scripting-using-awk

– ahota
Nov 20 '18 at 18:00

add a comment |

1 Answer
1

active

oldest

votes

With AWK, try something like:

gawk '{print gensub(/^>.*[(.+)]/, ">\1", 1)}' text

which outputs:

>SCHPO

M-----FRTCTKIGTVPKVLVNQKGLIDGLRRVTTDATTSRANPAHVPEEHDKPFPVKLD

DSVFEGYKIDVPSTEIEVTKGELLGLYEKMVTIRRLELACDALYKAKKIRGFCHLSIGQE

Explanations:

The awk function gensub() searches the string (defaulted to $0, the current line) for the regular expression (1st argument) and replaces the matched string with the 2nd argument. (Note that this is a very rough overview of gensub() function. See the man page for detailed explanations.)

The regular expression /^>.*[(.+)]/ matches with a line which starts with '>', followed by some characters, and a substring surrounded by square brackets.
Pay attention to the parens around the the pattern within the square brackets.

As for the 2nd argument, \1 (the leftmost backslash just escapes the next one) indicates the 1st parenthesized expression in the regular expression above.
It is called a back reference and you can reuse the matched substring (the information between the square brackets in this case) with this mechanism.

If the pattern matches, gensub() returns the modified string. Otherwise it returns the original string. So just saying print gensub() ... works for both matched and unmatched lines.

Hope this helps.

answered Nov 21 '18 at 0:06

tshiono

2,074234

This worked perfectly, thank you so much. The explanation was also really helpful

– Katie_C94
Nov 26 '18 at 14:16

Good to know I could be helpful to you. I would be appreciated if you can take an action following the instruction. BR.

– tshiono
Nov 27 '18 at 1:39

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53398730%2fextract-information-between-square-brackets-for-line-in-text-file-starting-with%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

With AWK, try something like:

gawk '{print gensub(/^>.*[(.+)]/, ">\1", 1)}' text

which outputs:

>SCHPO

M-----FRTCTKIGTVPKVLVNQKGLIDGLRRVTTDATTSRANPAHVPEEHDKPFPVKLD

DSVFEGYKIDVPSTEIEVTKGELLGLYEKMVTIRRLELACDALYKAKKIRGFCHLSIGQE

Explanations:

The awk function gensub() searches the string (defaulted to $0, the current line) for the regular expression (1st argument) and replaces the matched string with the 2nd argument. (Note that this is a very rough overview of gensub() function. See the man page for detailed explanations.)

The regular expression /^>.*[(.+)]/ matches with a line which starts with '>', followed by some characters, and a substring surrounded by square brackets.
Pay attention to the parens around the the pattern within the square brackets.

As for the 2nd argument, \1 (the leftmost backslash just escapes the next one) indicates the 1st parenthesized expression in the regular expression above.
It is called a back reference and you can reuse the matched substring (the information between the square brackets in this case) with this mechanism.

If the pattern matches, gensub() returns the modified string. Otherwise it returns the original string. So just saying print gensub() ... works for both matched and unmatched lines.

Hope this helps.

answered Nov 21 '18 at 0:06

tshiono

2,074234

This worked perfectly, thank you so much. The explanation was also really helpful

– Katie_C94
Nov 26 '18 at 14:16

Good to know I could be helpful to you. I would be appreciated if you can take an action following the instruction. BR.

– tshiono
Nov 27 '18 at 1:39

add a comment |

With AWK, try something like:

gawk '{print gensub(/^>.*[(.+)]/, ">\1", 1)}' text

which outputs:

>SCHPO

M-----FRTCTKIGTVPKVLVNQKGLIDGLRRVTTDATTSRANPAHVPEEHDKPFPVKLD

DSVFEGYKIDVPSTEIEVTKGELLGLYEKMVTIRRLELACDALYKAKKIRGFCHLSIGQE

Explanations:

The awk function gensub() searches the string (defaulted to $0, the current line) for the regular expression (1st argument) and replaces the matched string with the 2nd argument. (Note that this is a very rough overview of gensub() function. See the man page for detailed explanations.)

The regular expression /^>.*[(.+)]/ matches with a line which starts with '>', followed by some characters, and a substring surrounded by square brackets.
Pay attention to the parens around the the pattern within the square brackets.

As for the 2nd argument, \1 (the leftmost backslash just escapes the next one) indicates the 1st parenthesized expression in the regular expression above.
It is called a back reference and you can reuse the matched substring (the information between the square brackets in this case) with this mechanism.

If the pattern matches, gensub() returns the modified string. Otherwise it returns the original string. So just saying print gensub() ... works for both matched and unmatched lines.

Hope this helps.

answered Nov 21 '18 at 0:06

tshiono

2,074234

This worked perfectly, thank you so much. The explanation was also really helpful

– Katie_C94
Nov 26 '18 at 14:16

Good to know I could be helpful to you. I would be appreciated if you can take an action following the instruction. BR.

– tshiono
Nov 27 '18 at 1:39

add a comment |

With AWK, try something like:

gawk '{print gensub(/^>.*[(.+)]/, ">\1", 1)}' text

which outputs:

>SCHPO

M-----FRTCTKIGTVPKVLVNQKGLIDGLRRVTTDATTSRANPAHVPEEHDKPFPVKLD

DSVFEGYKIDVPSTEIEVTKGELLGLYEKMVTIRRLELACDALYKAKKIRGFCHLSIGQE

Explanations:

The awk function gensub() searches the string (defaulted to $0, the current line) for the regular expression (1st argument) and replaces the matched string with the 2nd argument. (Note that this is a very rough overview of gensub() function. See the man page for detailed explanations.)

The regular expression /^>.*[(.+)]/ matches with a line which starts with '>', followed by some characters, and a substring surrounded by square brackets.
Pay attention to the parens around the the pattern within the square brackets.

As for the 2nd argument, \1 (the leftmost backslash just escapes the next one) indicates the 1st parenthesized expression in the regular expression above.
It is called a back reference and you can reuse the matched substring (the information between the square brackets in this case) with this mechanism.

If the pattern matches, gensub() returns the modified string. Otherwise it returns the original string. So just saying print gensub() ... works for both matched and unmatched lines.

Hope this helps.

answered Nov 21 '18 at 0:06

tshiono

2,074234

With AWK, try something like:

gawk '{print gensub(/^>.*[(.+)]/, ">\1", 1)}' text

which outputs:

>SCHPO

M-----FRTCTKIGTVPKVLVNQKGLIDGLRRVTTDATTSRANPAHVPEEHDKPFPVKLD

DSVFEGYKIDVPSTEIEVTKGELLGLYEKMVTIRRLELACDALYKAKKIRGFCHLSIGQE

Explanations:

The awk function gensub() searches the string (defaulted to $0, the current line) for the regular expression (1st argument) and replaces the matched string with the 2nd argument. (Note that this is a very rough overview of gensub() function. See the man page for detailed explanations.)

The regular expression /^>.*[(.+)]/ matches with a line which starts with '>', followed by some characters, and a substring surrounded by square brackets.
Pay attention to the parens around the the pattern within the square brackets.

As for the 2nd argument, \1 (the leftmost backslash just escapes the next one) indicates the 1st parenthesized expression in the regular expression above.
It is called a back reference and you can reuse the matched substring (the information between the square brackets in this case) with this mechanism.

If the pattern matches, gensub() returns the modified string. Otherwise it returns the original string. So just saying print gensub() ... works for both matched and unmatched lines.

Hope this helps.

answered Nov 21 '18 at 0:06

tshiono

2,074234

answered Nov 21 '18 at 0:06

tshiono

2,074234

answered Nov 21 '18 at 0:06

tshiono

2,074234

answered Nov 21 '18 at 0:06

tshiono

2,074234

This worked perfectly, thank you so much. The explanation was also really helpful

– Katie_C94
Nov 26 '18 at 14:16

Good to know I could be helpful to you. I would be appreciated if you can take an action following the instruction. BR.

– tshiono
Nov 27 '18 at 1:39

add a comment |

This worked perfectly, thank you so much. The explanation was also really helpful

– Katie_C94
Nov 26 '18 at 14:16

Good to know I could be helpful to you. I would be appreciated if you can take an action following the instruction. BR.

– tshiono
Nov 27 '18 at 1:39

This worked perfectly, thank you so much. The explanation was also really helpful

– Katie_C94
Nov 26 '18 at 14:16

Good to know I could be helpful to you. I would be appreciated if you can take an action following the instruction. BR.

– tshiono
Nov 27 '18 at 1:39

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu