I think making the consortium data (raw reads, bam files and consensus calls) available on AWS will be very useful for us. We have a custom NGS analysis pipeline and we have used it on the thousand genome data on AWS. Having access to validated reference data that we can use to test our tools will be very valuable. I agree that downloading data locally for processing is a huge obstacle.
I will definitely use the AWS data (raw reads) to run our pipelines on them. I am not sure that the SRA toolkit access is suitable due to i/o (will need more details). We will need to copy the data anyway locally to AWS EBS volumes to use it. In fact what would be more useful, rather than have these data files in S3, it would be best to have a snapshot of Linux formatted EBS volumes with the data, or in fact a snapshot of the NFS mountable RAID that any fuser can instantiate and attach to a cluster (e.g. instantiated with STartCluster) to run analysis from them.
I imagine other tools to be used in the project that need to run from the raw data could benefit from running in AWS (e.g. Cortex, LobSTR) because downloading data and moving it around is a major obstacle. Long term, as reference data sets are established and they are used to validate new analytic pipelines, having access to this in an environment where the new tools can run for testing purposes would be very valuable. Raw read data may be less used than aligned BAMs or later on VCFs, but I will suggest you monitor usage and depending on this decide when removing these files from there.
Pete Estep from PGP asked whether any trios were of Ashkenazy Jewish origin, and according to a report from the family, huAA53E0, hu8E87A9, and hu6E4515 are 100% Ashkenazim Jewish, and parental (hu8E87A9 and hu6E4515) ancestry traces to Belorussian/Ukranian/Lithunian Jewish ghettos. They've also had another report of a mixed Ashkenazi trio: "My father [hu28DA07[ and I [hu16360E] do have some Ashkenazi ancestry. He is probably about one half and I am about one quarter."
Also, one of the trio mothers (hu25DE85) has reported having Sephardic Jewish ancestry.
Are any of the candidate genomes from the PGP set people of Ashkenazy Jewish origin? That would be a good group to have, for a variety of reasons - genetic isolation for a long time until recently, lots of genetics already known in this population, etc.
Interested in AWS.
I think making the consortium data (raw reads, bam files and consensus calls) available on AWS will be very useful for us. We have a custom NGS analysis pipeline and we have used it on the thousand genome data on AWS. Having access to validated reference data that we can use to test our tools will be very valuable. I agree that downloading data locally for processing is a huge obstacle.
Yes to AWS data access
I will definitely use the AWS data (raw reads) to run our pipelines on them. I am not sure that the SRA toolkit access is suitable due to i/o (will need more details). We will need to copy the data anyway locally to AWS EBS volumes to use it. In fact what would be more useful, rather than have these data files in S3, it would be best to have a snapshot of Linux formatted EBS volumes with the data, or in fact a snapshot of the NFS mountable RAID that any fuser can instantiate and attach to a cluster (e.g. instantiated with STartCluster) to run analysis from them.
I imagine other tools to be used in the project that need to run from the raw data could benefit from running in AWS (e.g. Cortex, LobSTR) because downloading data and moving it around is a major obstacle. Long term, as reference data sets are established and they are used to validate new analytic pipelines, having access to this in an environment where the new tools can run for testing purposes would be very valuable. Raw read data may be less used than aligned BAMs or later on VCFs, but I will suggest you monitor usage and depending on this decide when removing these files from there.
Ashkenazy Jewish trio
Pete Estep from PGP asked whether any trios were of Ashkenazy Jewish origin, and according to a report from the family, huAA53E0, hu8E87A9, and hu6E4515 are 100% Ashkenazim Jewish, and parental (hu8E87A9 and hu6E4515) ancestry traces to Belorussian/Ukranian/Lithunian Jewish ghettos. They've also had another report of a mixed Ashkenazi trio: "My father [hu28DA07[ and I [hu16360E] do have some Ashkenazi ancestry. He is probably about one half and I am about one quarter."
Also, one of the trio mothers (hu25DE85) has reported having Sephardic Jewish ancestry.
Steven Salzberg comment
Are any of the candidate genomes from the PGP set people of Ashkenazy Jewish origin? That would be a good group to have, for a variety of reasons - genetic isolation for a long time until recently, lots of genetics already known in this population, etc.
I agree with you, Marc. Not
I agree with you, Marc. Not only would this show total transparency, but the consortium may also benefit from having "many eyes" looking at the data.