From stephenhuo at yahoo.com Mon Sep 8 10:58:32 2003 From: stephenhuo at yahoo.com (yongyang huo) Date: Mon Sep 8 17:59:12 2003 Subject: [Discuss] the data set problem Message-ID: <20030908165833.41002.qmail@web60103.mail.yahoo.com> the data set problem: Dear all, i have downloaded the dataset "513_distribute.tar.gz" in the link following: http://www.compbio.dundee.ac.uk/~www-jpred/data/ inside the 513_distribute data, for example, the file name:"1aazb-1-DOMAK.all", has got the information as follows: RES: M,F,K,V,Y,G,Y,D,S,N,I,H,K,... DSSP:_,E,E,E,E,E,_,_,T,T,T,S,_,... DSSPACC:e,b,e,b,b,b,b,e,b,e,b,... STRIDE:C,E,E,E,E,E,C,T,T,T,T,T,... RsNo:1,2,3,4,5,6,7,8,9,10,11,12,... DEFINE:E,E,E,E,E,E,_,_,_,_,_,_,_,... i just make used of the first two line: RES and DSSP, RES: M,F,K,V,Y,G,Y,D,S,N,I,H,K,... DSSP:_,E,E,E,E,E,_,_,T,T,T,S,_,... I used the method that proposed by Mohammed Ouali AND ross D.King, through the following conservative mapping to train the method: H, I, and G states from DSSP are translated as alpha helix (H), E is translated as Beta-strands (E), and the remainder is translated as coil (C). and apply this method in DSSP: DSSP:_,E,E,E,E,E,_,_,T,T,T,S,_,... could be translated into: DSSP;C;E;E;E;E;E;C;C;C;C;C;C;C;... and then i used SNNS tool(a program deal with the Neural Network) to read in the data,with the sequence window size 15, and analysis, but the result is far from what i expected, that's why i raised this question: is it correct that i make use of the data file in this way, or should i used the DEFINE information rather than the DSSP information?? what is the DSSPACC means? and what's the meaning of the DSSPACC line's character "e"and "b"?? Could anyone tell me, thank you very much!! Best Regards! --------------------------------- Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software -------------- next part -------------- An HTML attachment was scrubbed... URL: /mailman/public/attachments/20030908/184fa18a/attachment.htm From jon at compbio.dundee.ac.uk Tue Sep 9 12:47:28 2003 From: jon at compbio.dundee.ac.uk (Jonathan Barber) Date: Tue Sep 9 11:47:34 2003 Subject: [Discuss] the data set problem In-Reply-To: <20030908165833.41002.qmail@web60103.mail.yahoo.com> References: <20030908165833.41002.qmail@web60103.mail.yahoo.com> Message-ID: <20030909104728.GB22569@flea.compbio.dundee.ac.uk> On Mon, Sep 08, 2003 at 09:58:32AM -0700, yongyang huo wrote: > the data set problem: > Dear all, > i have downloaded the dataset "513_distribute.tar.gz" in the link > following: > http://www.compbio.dundee.ac.uk/~www-jpred/data/ > inside the 513_distribute data, for example, the file > name:"1aazb-1-DOMAK.all", has got the information as follows: > RES: M,F,K,V,Y,G,Y,D,S,N,I,H,K,... > DSSP:_,E,E,E,E,E,_,_,T,T,T,S,_,... > DSSPACC:e,b,e,b,b,b,b,e,b,e,b,... > STRIDE:C,E,E,E,E,E,C,T,T,T,T,T,... > RsNo:1,2,3,4,5,6,7,8,9,10,11,12,... > DEFINE:E,E,E,E,E,E,_,_,_,_,_,_,_,... > i just make used of the first two line: RES and DSSP, > RES: M,F,K,V,Y,G,Y,D,S,N,I,H,K,... > DSSP:_,E,E,E,E,E,_,_,T,T,T,S,_,... > I used the method that proposed by Mohammed Ouali AND ross D.King, > through the following conservative mapping to train the method: H, I, > and G states from DSSP are translated as alpha helix (H), E is > translated as Beta-strands (E), and the remainder is translated as > coil (C). and apply this method in DSSP: > DSSP:_,E,E,E,E,E,_,_,T,T,T,S,_,... > could be translated into: > DSSP;C;E;E;E;E;E;C;C;C;C;C;C;C;... > and then i used SNNS tool(a program deal with the Neural Network) to > read in the data,with the sequence window size 15, and analysis, but > the result is far from what i expected, that's why i raised this The window size may need to be expand, Jnet uses a window size of 17 in the first network and 19 in the second. Note that 2 networks are used. The first has inputs for 17 residues, and three outputs, with prediction values for each state. The second network then has inputs for the first networks output, and has three outputs. > question: is it correct that i make use of the data file in this way, > or should i used the DEFINE information rather than the DSSP > information?? what is the DSSPACC means? and what's the meaning of the > DSSPACC line's character "e"and "b"?? Could anyone tell me, thank you This refers to the whether the residue is buried or exposed. I believe that greater than 25% relative solvent accessibility was defined as exposed. In the Jnet papers, DSSP was found to be the best definition of secondary structure (at least for training for secondary structure prediction), with reductions to a three state defintion of: strand = { B, E } helix = { G } coil = { H, I, S, T, _ } Define and Stride are both other methods of assigning secondary structure. > very much!! > Best Regards! -- Jon From stephenhuo at yahoo.com Sun Sep 14 18:24:58 2003 From: stephenhuo at yahoo.com (yongyang huo) Date: Mon Sep 15 01:25:40 2003 Subject: [Discuss] the data set problem In-Reply-To: <20030909104728.GB22569@flea.compbio.dundee.ac.uk> Message-ID: <20030915002458.23236.qmail@web60105.mail.yahoo.com> Dear Mr Jonathan Barber: Thank you very much for your information!! but there is still one thing i don't understand the word"Note that 2 networks are used. The first has inputs for 17 residues, and three outputs, with prediction values for each state. The second network then has inputs for the first networks output, and has three outputs." is that means in this two neural network, there are no hidden layer in it? How to build up the second layer? i mean, in the first one, the output is the prediction value of the middle residue inside the 17 residue window, but in the second layer, what's the input of it? As the inoutof the first network is a 17 residue sequence, what it means the inout 19 residue window in second network? Could you make a simple example? thank you very much! Jonathan Barber wrote: On Mon, Sep 08, 2003 at 09:58:32AM -0700, yongyang huo wrote: > the data set problem: > Dear all, > i have downloaded the dataset "513_distribute.tar.gz" in the link > following: > http://www.compbio.dundee.ac.uk/~www-jpred/data/ > inside the 513_distribute data, for example, the file > name:"1aazb-1-DOMAK.all", has got the information as follows: > RES: M,F,K,V,Y,G,Y,D,S,N,I,H,K,... > DSSP:_,E,E,E,E,E,_,_,T,T,T,S,_,... > DSSPACC:e,b,e,b,b,b,b,e,b,e,b,... > STRIDE:C,E,E,E,E,E,C,T,T,T,T,T,... > RsNo:1,2,3,4,5,6,7,8,9,10,11,12,... > DEFINE:E,E,E,E,E,E,_,_,_,_,_,_,_,... > i just make used of the first two line: RES and DSSP, > RES: M,F,K,V,Y,G,Y,D,S,N,I,H,K,... > DSSP:_,E,E,E,E,E,_,_,T,T,T,S,_,... > I used the method that proposed by Mohammed Ouali AND ross D.King, > through the following conservative mapping to train the method: H, I, > and G states from DSSP are translated as alpha helix (H), E is > translated as Beta-strands (E), and the remainder is translated as > coil (C). and apply this method in DSSP: > DSSP:_,E,E,E,E,E,_,_,T,T,T,S,_,... > could be translated into: > DSSP;C;E;E;E;E;E;C;C;C;C;C;C;C;... > and then i used SNNS tool(a program deal with the Neural Network) to > read in the data,with the sequence window size 15, and analysis, but > the result is far from what i expected, that's why i raised this The window size may need to be expand, Jnet uses a window size of 17 in the first network and 19 in the second. Note that 2 networks are used. The first has inputs for 17 residues, and three outputs, with prediction values for each state. The second network then has inputs for the first networks output, and has three outputs. > question: is it correct that i make use of the data file in this way, > or should i used the DEFINE information rather than the DSSP > information?? what is the DSSPACC means? and what's the meaning of the > DSSPACC line's character "e"and "b"?? Could anyone tell me, thank you This refers to the whether the residue is buried or exposed. I believe that greater than 25% relative solvent accessibility was defined as exposed. In the Jnet papers, DSSP was found to be the best definition of secondary structure (at least for training for secondary structure prediction), with reductions to a three state defintion of: strand = { B, E } helix = { G } coil = { H, I, S, T, _ } Define and Stride are both other methods of assigning secondary structure. > very much!! > Best Regards! -- Jon --------------------------------- Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software -------------- next part -------------- An HTML attachment was scrubbed... URL: /mailman/public/attachments/20030914/202762a8/attachment.htm From geoff at compbio.dundee.ac.uk Mon Sep 15 10:19:46 2003 From: geoff at compbio.dundee.ac.uk (Geoff Barton) Date: Mon Sep 15 09:20:39 2003 Subject: [Discuss] the data set problem In-Reply-To: <20030915002458.23236.qmail@web60105.mail.yahoo.com> References: <20030915002458.23236.qmail@web60105.mail.yahoo.com> Message-ID: Dear yonghang huo, Thanks for continued interest in our work. Most of the answers to the questions you ask are in the publications I have referred you to, or are explained in the earlier work by Rost and Sander. You can also download the JNet prediction program from our site and look at the source code yourself. I strongly suggest that you read that work and associated literature carefully, then if there are still things that are unclear get back to us. With best regards, Geoff. On Sun, 14 Sep 2003, yongyang huo wrote: > Dear Mr Jonathan Barber: > Thank you very much for your information!! but there is still one thing > i don't understand the word"Note that 2 networks are used. The first has > inputs for 17 residues, and three outputs, with prediction values for > each state. The second network then has inputs for the first networks > output, and has three outputs." is that means in this two neural > network, there are no hidden layer in it? How to build up the second > layer? i mean, in the first one, the output is the prediction value of > the middle residue inside the 17 residue window, but in the second > layer, what's the input of it? As the inoutof the first network is a 17 > residue sequence, what it means the inout 19 residue window in second > network? Could you make a simple example? thank you very much! > > Jonathan Barber wrote: > On Mon, Sep 08, 2003 at 09:58:32AM -0700, yongyang huo wrote: > > > the data set problem: > > Dear all, > > i have downloaded the dataset "513_distribute.tar.gz" in the link > > following: > > http://www.compbio.dundee.ac.uk/~www-jpred/data/ > > inside the 513_distribute data, for example, the file > > name:"1aazb-1-DOMAK.all", has got the information as follows: > > RES: M,F,K,V,Y,G,Y,D,S,N,I,H,K,... > > DSSP:_,E,E,E,E,E,_,_,T,T,T,S,_,... > > DSSPACC:e,b,e,b,b,b,b,e,b,e,b,... > > STRIDE:C,E,E,E,E,E,C,T,T,T,T,T,... > > RsNo:1,2,3,4,5,6,7,8,9,10,11,12,... > > DEFINE:E,E,E,E,E,E,_,_,_,_,_,_,_,... > > i just make used of the first two line: RES and DSSP, > > RES: M,F,K,V,Y,G,Y,D,S,N,I,H,K,... > > DSSP:_,E,E,E,E,E,_,_,T,T,T,S,_,... > > I used the method that proposed by Mohammed Ouali AND ross D.King, > > through the following conservative mapping to train the method: H, I, > > and G states from DSSP are translated as alpha helix (H), E is > > translated as Beta-strands (E), and the remainder is translated as > > coil (C). and apply this method in DSSP: > > DSSP:_,E,E,E,E,E,_,_,T,T,T,S,_,... > > could be translated into: > > DSSP;C;E;E;E;E;E;C;C;C;C;C;C;C;... > > and then i used SNNS tool(a program deal with the Neural Network) to > > read in the data,with the sequence window size 15, and analysis, but > > the result is far from what i expected, that's why i raised this > > The window size may need to be expand, Jnet uses a window size of 17 in > the first network and 19 in the second. > > Note that 2 networks are used. The first has inputs for 17 residues, and > three outputs, with prediction values for each state. The second network > then has inputs for the first networks output, and has three outputs. > > > question: is it correct that i make use of the data file in this way, > > or should i used the DEFINE information rather than the DSSP > > information?? what is the DSSPACC means? and what's the meaning of the > > DSSPACC line's character "e"and "b"?? Could anyone tell me, thank you > > This refers to the whether the residue is buried or exposed. I believe > that greater than 25% relative solvent accessibility was defined as > exposed. > > In the Jnet papers, DSSP was found to be the best definition of > secondary structure (at least for training for secondary structure > prediction), with reductions to a three state defintion of: > > strand = { B, E } > helix = { G } > coil = { H, I, S, T, _ } > > Define and Stride are both other methods of assigning secondary > structure. > > > very much!! > > Best Regards! > -- > Jon > > --------------------------------- > Do you Yahoo!? > Yahoo! SiteBuilder - Free, easy-to-use web site design software ------------------ Geoff Barton, Professor of Bioinformatics, School of Life Sciences University of Dundee, Scotland, UK. geoff@compbio.dundee.ac.uk Tel:+44 1382 345860/345843 (Fax:345764) www.compbio.dundee.ac.uk From stephenhuo at yahoo.com Fri Sep 26 17:55:48 2003 From: stephenhuo at yahoo.com (yongyang huo) Date: Sat Sep 27 00:56:04 2003 Subject: [Discuss] Question about the 21 unit problem Message-ID: <20030926235548.89175.qmail@web60103.mail.yahoo.com> Dear all: Could anyone explain why using the 21 units to represent each amino acid unit when building neural network to predict protein secondary structure? as the paper indicated that 20 bits for 20 amino acids(G; A; V; L; I; P; F; Y; W; S; T; C; M; N; Q; D; E; K; R; H) and one for padding space, but i have no idea of what that particula padding space mean, for example, by using a window of 15 residues, if the sequence is like this: FANGDPSKVSFRPSI, in my opinion, the sling window could not move further as there are only 15 residues inside the window, it could only predict the central amino acid, which is "k", the additional bit for padding space seems make no use at all, 20 units to represent each amino acid is enough, am I wrong in this way of understanding? could anyone make a simple example shows how to make use of the addtional unit? Thank you very much! Best regards! stephen --------------------------------- Do you Yahoo!? The New Yahoo! Shopping - with improved product search -------------- next part -------------- An HTML attachment was scrubbed... URL: /mailman/public/attachments/20030926/a6b71ae6/attachment.htm From geoff at compbio.dundee.ac.uk Sun Sep 28 12:12:27 2003 From: geoff at compbio.dundee.ac.uk (Geoff Barton) Date: Sun Sep 28 11:06:22 2003 Subject: [Discuss] Question about the 21 unit problem In-Reply-To: <20030926235548.89175.qmail@web60103.mail.yahoo.com> References: <20030926235548.89175.qmail@web60103.mail.yahoo.com> Message-ID: The 21st "amino acid" is there to deal with the ends. Thus, if you have a window that overlaps either the N- or -C terminus of the protein you need somehow to pad the inputs with dummy residues. You are welcome to download the JNet code from http://www.compbio.dundee.ac.uk and see how it was done there. Best regards, Geoff. On Fri, 26 Sep 2003, yongyang huo wrote: > Dear all: Could anyone explain why using the 21 units to represent each > amino acid unit when building neural network to predict protein > secondary structure? as the paper indicated that 20 bits for 20 amino > acids(G; A; V; L; I; P; F; Y; W; S; T; C; M; N; Q; D; E; K; R; H) and > one for padding space, but i have no idea of what that particula padding > space mean, for example, by using a window of 15 residues, if the > sequence is like this: FANGDPSKVSFRPSI, in my opinion, the sling window > could not move further as there are only 15 residues inside the window, > it could only predict the central amino acid, which is "k", the > additional bit for padding space seems make no use at all, 20 units to > represent each amino acid is enough, am I wrong in this way of > understanding? could anyone make a simple example shows how to make use > of the addtional unit? Thank you very much! Best regards! stephen > ------------------ Geoff Barton, Professor of Bioinformatics, School of Life Sciences University of Dundee, Scotland, UK. geoff@compbio.dundee.ac.uk Tel:+44 1382 345860/345843 (Fax:345764) www.compbio.dundee.ac.uk From stephenhuo at yahoo.com Sun Sep 28 10:33:17 2003 From: stephenhuo at yahoo.com (yongyang huo) Date: Sun Sep 28 17:33:26 2003 Subject: [Discuss] Question about the 21 unit problem In-Reply-To: Message-ID: <20030928163317.40727.qmail@web60109.mail.yahoo.com> Dear Mr Barton: thanks a lot, though i'm not familiar with C, i would try to read the source code, it may solve most of the problem that i raised ^_^ To the N- or -C terminus of the protein you mentioned, for instance, in a protein just contain the single amino acid sequence as FANGDPSKVSFRPSI, is that means the first begining residue F is the N-terminus and the ending residue I is the -C terminus, which means if the seqence contains M(M>1) number of residues, the sliding window could move M steps. Originally i just use 20 units to represent each amino acid, therefore if the window size is N, the sling window positioned in the amino aci sequence along could move M-N+1 steps,losing the secondary structure classification of the begining N/2 residues and the ending N/2 residues. Is it correct? Thank you very much! Best Regards! Geoff Barton wrote: The 21st "amino acid" is there to deal with the ends. Thus, if you have a window that overlaps either the N- or -C terminus of the protein you need somehow to pad the inputs with dummy residues. You are welcome to download the JNet code from http://www.compbio.dundee.ac.uk and see how it was done there. Best regards, Geoff. On Fri, 26 Sep 2003, yongyang huo wrote: > Dear all: Could anyone explain why using the 21 units to represent each > amino acid unit when building neural network to predict protein > secondary structure? as the paper indicated that 20 bits for 20 amino > acids(G; A; V; L; I; P; F; Y; W; S; T; C; M; N; Q; D; E; K; R; H) and > one for padding space, but i have no idea of what that particula padding > space mean, for example, by using a window of 15 residues, if the > sequence is like this: FANGDPSKVSFRPSI, in my opinion, the sling window > could not move further as there are only 15 residues inside the window, > it could only predict the central amino acid, which is "k", the > additional bit for padding space seems make no use at all, 20 units to > represent each amino acid is enough, am I wrong in this way of > understanding? could anyone make a simple example shows how to make use > of the addtional unit? Thank you very much! Best regards! stephen > ------------------ Geoff Barton, Professor of Bioinformatics, School of Life Sciences University of Dundee, Scotland, UK. geoff@compbio.dundee.ac.uk Tel:+44 1382 345860/345843 (Fax:345764) www.compbio.dundee.ac.uk --------------------------------- Do you Yahoo!? The New Yahoo! Shopping - with improved product search -------------- next part -------------- An HTML attachment was scrubbed... URL: /mailman/public/attachments/20030928/adf50010/attachment.htm From geoff at compbio.dundee.ac.uk Sun Sep 28 22:02:47 2003 From: geoff at compbio.dundee.ac.uk (Geoff Barton) Date: Sun Sep 28 20:56:41 2003 Subject: [Discuss] Question about the 21 unit problem In-Reply-To: <20030928163317.40727.qmail@web60109.mail.yahoo.com> References: <20030928163317.40727.qmail@web60109.mail.yahoo.com> Message-ID: I strongly recommend you read a basic textbook about protein structure. This will help you in understanding the terminology used and the issues regarding protein structure prediction. I recommend Branden & Tooze "an introduction to protein structure". See: http://www.amazon.co.uk/exec/obidos/ASIN/0815323050/ref=sr_aps_books_1_1/026-8490319-8414053 as a good starting point. Kind regards, Geoff. On Sun, 28 Sep 2003, yongyang huo wrote: > Dear Mr Barton: > thanks a lot, though i'm not familiar with C, i would try to read the source code, it may solve most of the problem that i raised ^_^ > To the N- or -C terminus of the protein you mentioned, for instance, in a protein just contain the single amino acid sequence as FANGDPSKVSFRPSI, is that means the first begining residue F is the N-terminus and the ending residue I is the -C terminus, which means if the seqence contains M(M>1) number of residues, the sliding window could move M steps. Originally i just use 20 units to represent each amino acid, therefore if the window size is N, the sling window positioned in the amino aci sequence along could move M-N+1 steps,losing the secondary structure classification of the begining N/2 residues and the ending N/2 residues. Is it correct? Thank you very much! > > Best Regards! > Geoff Barton wrote: > > The 21st "amino acid" is there to deal with the ends. Thus, if you have a > window that overlaps either the N- or -C terminus of the protein you need > somehow to pad the inputs with dummy residues. > > You are welcome to download the JNet code from > http://www.compbio.dundee.ac.uk and see how it was done there. > > Best regards, > > Geoff. ------------------ Geoff Barton, Professor of Bioinformatics, School of Life Sciences University of Dundee, Scotland, UK. geoff@compbio.dundee.ac.uk Tel:+44 1382 345860/345843 (Fax:345764) www.compbio.dundee.ac.uk