Login

or
or

File Reading in Perl

File reading is an integral part of bioinformatics. It is not possible to search relevant information in large files. There are many ways to read a file in different programming languages. Usually, file reading in bioinformatics is performed using scripting languages like Perl, Python, Ruby etc rather than hardcore languages like Java, .Net etc. Scripting languages have prevalence because they are easy and very flexible without a large set of prerequisites. I prefer to use Perl for my basic tasks. In this section I am going to share some basic codes that I use for data parsing. Majority of the codes are in perl, if someone wants to try they are really easy. One can download the PERL, according to the OS they are using https://www.perl.org/get.html. One can write any of the Perl code in simple notepad and can save the file with .pl extension rather than .txt. For ease of coding and opening large files I will recommend to use Notepad++ (https://notepad-plus-plus.org/download/v7.7.1.html) rather than simple notepad. 

I will start with reading a simple few lines of VCF file. For example, one wants to read the following VCF file using perl. Can be downloaded directly from here (Test.vcf)

 

Save the following lines in notepad++ and save as Test.vcf

....

....

##contig=<ID=chrY,length=59373566>
##reference=file:///newstorage/home/adnan/Genomics/Genomes/Indexes/star/human/genome.fa
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chrM 146 . T C 6797.77 . AC=2;AF=1.00;AN=2;DP=134;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=34.24;SOR=0.886 GT:AD:DP:GQ:PL 1/1:0,134:134:99:6826,469,0
chrM 150 . T C 6657.77 . AC=2;AF=1.00;AN=2;DP=130;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=30.63;SOR=1.127 GT:AD:DP:GQ:PL 1/1:0,130:130:99:6686,422,0
chrM 152 . T C 5780.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=-0.183;ClippingRankSum=0.000;DP=131;ExcessHet=3.0103;FS=3.849;MLEAC=2;MLEAF=1.00;MQ=60.00;MQRankSum=0.000;QD=29.09;ReadPosRankSum=1.740;SOR=0.849 GT:AD:DP:GQ:PL 1/1:1,130:131:99:5809,353,0
chrM 195 . C T 4560.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=3.317;ClippingRankSum=0.000;DP=117;ExcessHet=3.0103;FS=2.994;MLEAC=2;MLEAF=1.00;MQ=60.00;MQRankSum=0.000;QD=32.93;ReadPosRankSum=2.438;SOR=0.150 GT:AD:DP:GQ:PL 1/1:2,115:117:99:4589,282,0
chrM 302 . AC A 706.73 . AC=2;AF=1.00;AN=2;BaseQRankSum=-0.764;ClippingRankSum=0.000;DP=137;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQRankSum=0.000;QD=25.24;ReadPosRankSum=2.234;SOR=4.407 GT:AD:DP:GQ:PL 1/1:2,26:28:32:744,32,0
chrM 410 . A T 5991.77 . AC=2;AF=1.00;AN=2;DP=148;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=30.05;SOR=5.283 GT:AD:DP:GQ:PL 1/1:0,148:148:99:6020,445,0
chrM 495 . AC A 1435.73 . AC=2;AF=1.00;AN=2;DP=241;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=31.91;SOR=1.714 GT:AD:DP:GQ:PL 1/1:0,45:45:99:1473,137,0

...

...

 

Perl code to open the file. Save the following lines in notepad++ and save as Test_1.pl

open(in,"Test.vcf"); #in is a variable and any word or alphabet can be used; Test.vcf, if the file is in same directory; if they are in any other directory, please give full path.

while($line = <in>) #while is a key word to tell program that do the action in brackets until there are lines in file

{

print $line; #Print whatever read in the cmd screen

sleep(2); #Sleep is to stop computer from doing anything for a duration mentioned in brackets, this is to slow the program otherwise you will see nothing, because computers are really fast in this kind of tasks.

}

 

Perl Code file can be downloaded from here (Test_1.pl). In next sections I will share my codes for data manipulation. After reading the data we can do anything we want to the data and save them in any format we want. If you have any questions or queries please let me know using contact form or can leave a comment below. 

 

 

You have no rights to post comments

Search

Breadcrumbs

© 2018 BioinfoGuide. All Rights Reserved.