



全文預(yù)覽已結(jié)束
下載本文檔
版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
SearchingSingleNucleotidePolymorphismMarkerstoComplexDiseasesusingGeneticAlgorithmFrameworkandaBoostModeSupportVectorMachineKhantharatAnekboon,SuphakantPhimoltares,andChidchanokLursinsapAVIC,DepartmentofMathematics,ChulalongkornUniversity,Bangkok,ThailandKhantharat.AStudent.chula.ac.th,suphakant.pchula.ac.th,andlchidchachula.ac.thSissadesTongsimaGenomeInstitute,NationalCenterforGeneticEngineeringandBiotechnology,Pathumtani,Thailandsissadesbiotec.or.thSuthatFucharoenThalassemiaResearchCenter,InstituteofMolecularBiosciences,MahidolUniversity,SalayaCampus,Nakhonpathom,Thailandgrsfcmahidol.ac.thAbstractWiththeadventoflarge-scalehighdensitysinglenucleotidepolymorphism(SNP)arrays,case-controlassociationstudieshavebeenperformedtoidentifypredisposinggeneticfactorsthatinfluencemanycommoncomplexdiseases.ThesegenotypingplatformsprovideverydenseSNPcoverageperonechip.Muchresearchhasbeenfocusingonmultivariategeneticmodeltoidentifygenesthatcanpredictthediseasestatus.However,increasingthenumberofSNPsgenerateslargenumberofcombinedgeneticoutcomestobetested.ThisworkpresentsanewmathematicalalgorithmforSNPanalysiscalledIFGAthatusesa“BoostMode”supportvectormachine(SVM)toselectthebestsetofSNPmarkersthatcanpredictastateofcomplexdiseases.Theproposedalgorithmhasbeenappliedtotestfortheassociationstudyintwodiseases,namelyCrohnsandseverityspectrumof0/HbEThalassemiadiseases.TheresultsrevealedthatourpredictedSNPscanrespectivelybestclassifybothdiseasesat71.57%and71.06%accuracyusing10-foldcrossvalidationcomparingwiththeoptimumrandomforest(ORF)andclassificationandregressiontrees(CART)techniques.Keywords-SingleNucleotidePolymorphism;SupportVectorMachine;GeneticAlgorithmI.INTRODUCTIONScientistshavelongbeeninterestedinidentifyinggeneticfactorsthatinfluencetheoccurrenceofcomplexdiseases.Withtheadventofparallelgenotypingtechnology,costandtimeinfindingSNPsarenotoutofreach.Largecase-controlcohortsgeneratedfromverydenseSNParrays(DNAchipcontainsdensearrayofSNPs)challengingresearcherstosearchforSNPsthatareassociatedwiththediseases.Incontrasttothesinglegenedisorders,thestateofcomplexdiseasescouldbetriggeredfrommultiplegeneswhenexposingtocertainenvironmentalfactors1,2.However,searchingformultiplemarkerinteractionsfromalargepoolofSNPsimposeshighcomputationalandmemorycomplexity.Atechniqueofselectingsubsetofrelevantfeatures,namedFeatureSelection3,hasbeenwidelyusedinalmostfields,includingbioinformatics.Thistechniqueprovidesmoreeffectivewaytoimprovelearningaccuracytounderstandtheimportanceofthefeaturesbyremovingirreverentorredundantones.II.THEPROPOSEDIFGAMETHODInthissection,weintroduceanewencodingmethodcalledIFGA.Fig.1demonstratesthesummaryoftheIFGAmethod.Thefirstpopulationisconstructedbyourproposedintegerencodingapproach.Thedatainthechromosome(inGeneticAlgorithm(GA)context)arerepresentedbyasetofselectedfeatures.Afterthepopulationisgenerated,eachchromosomeisevaluatedbyafitnessscore.ThisscoreisobtainedbyusingtheBoostMode-SVMapproach.Then,theIFGAre-generatesthenextpopulationbyIFGAselection,IFGAcross-over,andIFGAmutationuntilaterminationcriterionissatisfied.A.TheIntegerEncodingMethodUtilizingGAtoperformfeatureselectioncanbedonebyconvertinginputdatausingbinaryencoding4.Thelengthof978-1-4244-4713-8/10/$25.002010IEEEFigure1.TheoverallIFGAflowchar.achromosomeequalsanumberofallfeatures.Thesizeofencodedchromosomecorrespondsdirectlytothenumberofinputfeatures.This,however,presentsaproblemduetotworeasons.First,therunningtimehighlydependsonthelengthofchromosome.Second,ageneralbinaryencodingdoesnotfixanumberofselectedfeatures.Itfixesonlythelengthofthechromosome.TheIFGAintegerencodingmethodisproposedtosolvetheseproblems.Assumethatacase-controldatausedinthisstudyhavemnumberofgenotypes.LetQibetheithchromosomeprocessedinthealgorithm.ThelengthofQi,denotedby|Qi|,issettoaconstantlessthanorequaltom.Then,random|Qi|numbers,representthelocationstoselectthecorrespondinggenotypesfromagivenfeaturesequence.DuringtheIFGA,thelengthofeachchromosomeisnotnecessarilyidentical.Forexample,supposem=7,thechromosomesize(|Qi|)issetto3,andtherandomlyselectedlocationsare1,5,and6.So,thechromosomeQi=1,5,6.B.IFGASelectionEachindividualchromosomeisselectedbasedonitsfitnessscoreintoamatingpoolbyastochasticuniversalsamplingmethod(SUS)5.TheIFGAalsousesanelitismtechnique6,inwhichthenextgenerationchromosomederivesfromthebestchromosomeinacurrentgeneration.C.IFGACross-OverThecross-overfunctionoftraditionalGArandomlyselectstherecombinationpointandswapsthetwochromosomesflankingthispoint.Cross-overfromtheoriginalGA,however,cannotbeappliedtotheIFGAapproachbecauseallchromosomesmusthavethesamesizeandfeaturesfromthesamelocicannotbeonthesamechromosome.WemustdeviseanIFGAcross-overtechniquetoovercomethisproblem.Assumethat,parent1andparent2aretheparentalchromosomeswhereeachlocusisthepositionofselectedfeature.Eithernumberofparent1sorparent2slocusmustbemorethan1.Numberofbothparentsloci(parent1andparent2)mustbegreaterthanorequaltoone.Outputsfromthisalgorithmareoffspring1sandoffspring2.1:x2:y3:tmp1parent14:fori=0to|parent1|do5:v|tmp1|6:selrandom(1,2,.,v)7:xxsel8:tmp1tmp1parent1selsuppress9:endfor10:tmp2parent211:fori=0to|parent2|do12:v|tmp2|13:selrandom(1,2,.,v)14:yysel15:tmp2tmp2parent2sel16:endfor17:crandom1,min(|parent1|,|parent2|)118:offspring1x1,x2,.,xc,yc+1,.,y|parent2|19:offspring2y1,y2,.,yc,xc+1,.,x|parent1|D.IFGAMutationMutationfunctionaltersthevalueofaspecifiedlocus.Ithardlyoccurswhencomparingwiththecross-overprocess.IFGAmutationispresentedhere.Letmdenotethelengthofagivengenotypesequence,input_chromisachromosomethatwillbemutated,andoutput_chromisamutatedchromosome.Eachelementinachromosomeisaselectedfeature.1:pos_outrandom1,|input_chrom|2:pos_inrandom1,m3:fori=1to|input_chrom|do4:ifi=pos_outthen5:output_chromipos_in6:else7:output_chromiinput_chromi8:endif9:endforE.GeneratingaPopulationTherearetwokindsofpopulation,theinitialpopulationandthenextgenerationpopulation.TogeneratetheinitialpopulationwithPchromosomes,wherePisauser-definednumberofchromosomesinthepopulation,thealgorithmrepeatedlygeneratesthechromosomesbyintegerencodingmethodandaddsthemintothesetofpopulationuntilthenumberofthechromosomesinthepopulationisequaltoP.Ontheotherhand,thepopulationinthenextgenerationconsistsofthechromosomeb,thebestfitnessscorefromthecurrentgeneration,egroupsoffeaturesfromevolution,cross-overandmutation,andrgroupsofthefeaturesfromthenewre-selectedfeatures.Afteraddingbandetothenextgeneration,thosechromosomesarecheckedforredundancy.Eachchromosomemustbeidenticalinthenextgeneration.Duplicatedchromosomeswillberemoved.Ifthenumberofchromosomesinthenextgenerationislessthanthenumberofchromosomesinthecurrentgenerationthenanewsubsetsoffeatures,r,willberandomlycreatedandaddedtothenextgeneration.F.TerminationThisIFGAalgorithmconsistsofasetofrecursivestepsforgeneratingthepopulation,evaluationbyaBoostMode-SVM,IFGAselection,IFGAcross-over,andIFGAmutation.Thesestepsareexecuteduntilthenumberofthebestresultsremainsconstantinthenext300iterations.III.THEPROPOSEDBOOSTMODE-SVMMETHODThegoalofSVM7istofindamaximalseparatinghyperplane:eitherfor(1)linearlyseparablecaseor(2)thenonlinearlyseparablecase.Notedthat,wTisatransposevectorofweight,xiisaninputvector,isamappingfunction,andbisabiasvalue.yi=sign(wxi+b)(1)yi=sign(w(xi)+b)(2)Theseequationsfacethesameproblemoccurredwhentheinputdataareimbalanced.Thelearnedseparatinghyperplanefromimbalanceddatasetmayshifttoomuchinthedirectiontowardsthesmallergroupcomparedwiththetrueseparatinghyperplane8.Tosolvethisproblem,thedecisionhyperplaneshouldbeadjusted.Itcanbeseenfrom(1)and(2)thattheparameterweffectstheclassificationoutput.So,modifyingwwilladjustthedecisionhyperplane,whichmayimprovetheclassifier.A.BoostMode-SVMAnewtechniqueofoversamplingfornominalfeatureisproposedtoimprovetheperformanceoftheSVM.TheBoostMode-SVM(Fig.1)generatestwoSVMs,namelySVM1andSVM2.TheSVM1isconstructedforgeneratingthescoreofthetrainingdatasetwhereastheSVM2isthefinalSVMmodelforclassificationthetestset.First,onlythetrainingsetisusedtoconstructtheSVM1andtofindtheBoostMode.ThisBoostModeistheindicatorvectoroftheminoritydataset.ItisbroughttotestwiththeSVM1.Twoscoringmethods,anUnbiasedScoring(US)andaBiasScoring(BS),areproposedtofindthescoringvalue.TheUSmethodisperformedwhentheSVM1correctlyclassifiestheBoostMode,otherwisetheBSmethodisperformed.Afterthat,aScoringOver-Samplingapproach(SOS)isprocessedforaddingartificialdatatominoritygroupbysamplingthedataoftheminoritygroupuntilanumberofdataofbothgroupsareequal.Theminoritygroupinthispapermeansthegroupofdatahavingfewerelements.ThenewSVM2isconstructedfortheclassificationbytheprevioustrainingdatasetandnewsetofdatafromtheSOStechnique.Finally,thetestsetisrunintheSVM2fortheevaluation.TheerrorrateforthetestsetisthefitnessscorevalueusingintheIFGAsectionabove.B.FindingtheBoostModeTobalancethesizeofdatafrombothclasses,someadditionaldataintheminoritygroupmustbegenerated.Theselectedgeneratingmethod(eitherUSorBS)willdependuponaBoostModevector.ThefollowingproceduredescribeshowtocomputetheBoostModevector.Letnminorbethenumberofdataintheminoritygroup.Boostrapsamplingwithreplacementisappliedontheminoritygrouptogeneratetdatasets,i.e.BoostGroup1,.,BoostGroupt.EachBoostGroupicontainsnminordata.1:fori=1totdo.2:allmodeimode(BoostGroupi)3:endfor4:BoostModemode(allmodei)iC.TheUnbiasedScoringMethodThistechniqueisprocessedwhentheSVM1classifiestheBoostModecorrectly.Alldatapointshaveequalchances(equalscoringvalues)tobeselectedfortheover-samplingtechnique.ThefollowingalgorithmdescribestheprocessoffindingthescoringvaluebytheUStechnique.ThescoreValisanoutputfromthisalgorithm.1:fori=1tonminordo2:scoreVali=1/nminor3:endforD.TheBiasScoringMethodTheBStechniqueisrunwhentheSVM1incorrectlyclassifiesbytheBoostMode.Thescoringvalueiscalculatedfromthedistanceofitspointtothedecisionhyperplaneby(3)forlinearseparabilityor(4)fornonlinearseparability.distancei=wxi+b(3)distancei=w(xi)+b(4)Thedatapointthatiscorrectlyclassifiedhaslesserchance(lessscoringvalue)tobeselectedfortheoversamplingprocessthantheonethatiswronglyclassified.Therefore,increasinginnumberofincorrectclassificationswouldinfluencethehigherchanceofsamplestobechosenandviceversa.ThescoringvaluefortheBSmethodisdescribedbythefollowingalgorithm.Letdistancebeasetofdistancesofalldatapointsintheminoritygroup.TheoutputfromthisalgorithmisasetofscoreVal.1:sumSV102:minValmin(distancei)i3:addVal=absolute(minVal)+14:fori=1tonminordo5:tmpi=distancei+addVal6:sumSV1=sumSV1+tmpi7:endfor8:iftheminoritygroupisthecontrolgroupthen9:fori=1tonminordo10:tmpi=2tmpi11:endfor12:endif13:fori=1tonminordo14:sumSV2=015:forj=1toido16:sumSV2=sumSV2+tmpj17:endfor18:scoreValisumSV2/sumSV119:endforE.TheScoredOver-SamplingMethodTheobjectiveoftheSOSalgorithmistoselectdatafromtheminoritygroupdependingonthescoreVal,computedbyeitherUSalgorithmorBSalgorithm.LetMDidenotedataith,for1ithnminor.Thenumberofdatainminoritygroupandmajoritygroupsarenminorandnmajor,respectively.Anoutputofthisalgorithmisasetofadditionaldataaddedtotheminoritygroup,samp_data.1:z=nmajornminor2:fori=1to|scoreVal|do3:sumSV1=sumSV1+scoreVali4:endfor5:fori=1to|scoreVal|do6:sumSV2=07:forj=1toido8:sumSV2=sumSV2+scoreValj9:endfor10:mapScorei=sumSV2/sumSV111:endfor12:fori=1tozdo13:selectPos=rand(1)14:ifselectPos0andselectPosmapScore1then15:samp_datai=MD116:else17:forj=2to|scoreVal|do18:ifselectPosmapScorej1andselectPosmapScorejthen19:samp_datai=MDj20:endif21:endfor22:endif23:endforIV.EXPERIMENTSANDRESULTSTableIIshowsthecomparisonoftheIFGA-BoostMode-SVM,ORF9,andCART10by10-foldcrossvalidationofThalassemiasandCrohnsdiseases.OurIFGA-BoostMode-SVMperformsbetterclassificationthanthestandardORFandCARTmethods.Notethat,nofeat.,acc.,sen.,andspec.inTableIIarethenumberoffeatures,accuracy,sensitivity,andspecificity,respectively.Thalassemiadataset(503patientswith835SNPs)wereobtainedfromtheThalassemiaResearchCenter,MahidolUniversityandtheCrohndataset(357patientswith103SNPs)areobtainedfrom11.Missingdatawereinferredby2SNPphasingmethod12.ForSVM,asoftmarginRBFkernelfunctionwith=0.5wasdeployedtoanalyzebothCrohnsandThalasemiasdataset.DummyencodingisappliedforSVMasvectors100,010,and001whereagenotypeismajorhomozygote,minorhomozygote,andheterozygote,respectively.InIFGA,eachchromosomesizeisvariedfrom1to10.Therefore,featureselectionfrom1featureto10featuresisprocessed.ParametersintheIFGAweresetasfollows:thenumberofchromosomesis1000,thecross-overrateis0.7forThalassemiasand0.8forCrohnsdiseases,andthemutationrateis0.035forThalassemiasand0.001forCrohnsdiseases.TABLEI.THEEXPERIMENTALRESULTSDataset+Algorithmnofeat.acc.(%)sen.(%)spec.(%)Thal.+IFGA-BoostMode-SVM671.5776.3964.14Thal.+ORF654.2769.8430.30Thal.+CART669.3876.0759.09Crohn+IFGA-BoostMode-SVM871.0664.5874.90Crohn+ORF857.8820.1480.25Crohn+CART863.3123.6186.83V.CONCLUSIONAnewIFGAwithBoostMode-SVMwasproposedtoidentifythesusceptiblelocifromthecase-controlassociationstudies.TheIFGAtechniqueencodeschromosomesasdifferentintegersizes.TheSOStechniquesamplestheminoritydatasetbytwoscoringapproaches(USandBS)areproposed.Thismethodcanverywellbeappliedinthecase-controlassociationstudies.Theexperimentalresultsfromtworealdatasets:CrohnsandThalassemiasdiseasesshowthatfeatureselectionandclassificationbytheIFGAwithBoostMode-SVMoutperformsthestandardORF,andCARTtechniques.REFERENCES1J.Marchini,P.Donnelly,andL.R.Cardon,“Genome-widestrategiesfordetectingmultiplelocithatinfluencecomplexdiseases,”NatureGenetics,vol.37,pp.413417,March2005.2D.J.Weatherall,“Science,medicine,andthefuture:Singlegenedisordersorcomplextraits:Lessonsfromthethalassaemiasandothermonogenicdiseases,”BMJ,v
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025年社會心理學研究及實踐模擬試卷及答案
- 2025年網(wǎng)絡(luò)營銷與品牌推廣考試試題及答案
- 2025年社交媒體管理能力考試試卷及答案
- 2025年無線通信網(wǎng)絡(luò)相關(guān)考試試卷及答案
- 2025年國際貿(mào)易與投資實務(wù)考試試題及答案
- 2025年高爾夫教練職業(yè)資格考試試卷及答案
- 2025年經(jīng)濟法專業(yè)的國考真題及答案
- 2025年會計電算化考試試題及答案
- 2025年教育心理學考試題及答案
- 放射診療工作場所輻射防護安全管理制度文檔
- 中國急性缺血性卒中診治指南(2023版)
- 關(guān)于成立質(zhì)量管理領(lǐng)導小組的通知
- 社區(qū)衛(wèi)生服務(wù)中心十四五發(fā)展規(guī)劃
- 留守兒童關(guān)愛服務(wù)投標方案(技術(shù)標)
- 農(nóng)村建房安全責任合同協(xié)議書模板
- 體育教案–《足球基本規(guī)則》
- 2024年黑龍江省大慶六十九中中考物理模擬試卷(6月份)(四)
- 胸腔鏡下肺葉切除手術(shù)配合2
- KROHNE電磁流量計產(chǎn)品介紹
- 阿克蘇地區(qū)2024年六年級下學期小升初真題數(shù)學試卷含解析
- DL5009.3-2013電力建設(shè)安全工作規(guī)程第3部分:變電站
評論
0/150
提交評論