




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
半結(jié)構(gòu)化文半結(jié)構(gòu)化文本挖掘方楊建北京大學(xué)計(jì)算機(jī)科學(xué)技術(shù)研究1Text-centricXMLDocumentsText-centricXMLDocumentsmarkedupasE.g.,assemblymanuals,journalQueriesareuserinformationE.g.,givemetheSection(element)ofedocumentthattellsmehowtochangeabrakelightDifferentfromwell-structuredXMLquerieswhereyoutightlyspecifywhatyou’relooking2VectorspacesandVectorspacesandVectorspaces–tried+testedframeworkforkeywordretrievalOther“bagofwords”applicationsinclassification,clustering…Fortext-centricXMLretrieval,canwemakeuseofvectorspaceideas?Challenge:capturethestructureofanXMLdocumentinthevectorspace.3VectorspacesandForinstance,distinguishbetweenVectorspacesandForinstance,distinguishbetweenfollowingtwoThePearlyMicrosoftBillBill4Content-richXML:MicrosoftTheContent-richXML:MicrosoftTheLexicon5EncodingtheGatesEncodingtheGatesWhataretheaxesofthevectorIntextretrieval,therewouldbeasingleaxisforGatesHerewemustseparateoutthetwooccurrences,underAuthorandTitleThus,axesmustrepresentnotonlyterms,butsomethingabouttheirpositioninanXMLtree6Beforeaddressingthis,letustheBeforeaddressingthis,letusthekindsofquerieswewanttoMicrosoft7QueryTheprecedingQueryTheprecedingexamplescanbeviewedassubtreesofthedocumentButwhat(GatessomewhereunderneathThisisharderandwewillreturntoit8SubtreesandConsiderallsubtreesofthethatincludeatleastonelexiconMicrosoftSubtreesandConsiderallsubtreesofthethatincludeatleastonelexiconMicrosoftMicrosoft…MicrosoftMicrosoft9StructuralCalleachoftheresulting(8+,inpreviousStructuralCalleachoftheresulting(8+,inpreviousslide)subtreesastructuralNotethatstructuraltermsmightoccurmultipletimesinadocumentCreateoneaxisinthevectorspaceforeachdistinctstructuraltermWeightsbasedonfrequenciesfornumberofoccurrences(justaswehadtf)Alltheusualissueswithterms(stemming?Casefolding?)remainExampleoftfToExampleoftfTobeortoExercise:HowmanyaxesarethereinthisHerethestructuraltermscontainingtoorbewouldhavemoreweightthanthosethatForthedocontheleft:inastructuraltermrootedatthenodePlay,shouldn’tHamlethaveForthedocontheleft:inastructuraltermrootedatthenodePlay,shouldn’tHamlethaveahighertfweightIdea:multiplytfcontributionofatermtoanodeklevelsupbyk,forsomeg<Hamlet=0.8Forthedoconthepreviousslide,theHamletismultipliedbyYorickismultipliedbyinanystructuraltermrootedatThenumberofThenumberofstructuralCanbeAlright,howhuge,ImpracticaltobuildavectorindexwithsomanyWillexaminepragmaticsolutionstothisshortly;fornow,continuetobelieve…Structuralterms:Structuralterms:Thenotionofstructuraltermsisindependentofanyschema/DTDfortheXMLdocumentsWell-suitedtoaheterogeneouscollectionofXMLEachdocumentbecomesavectorinthespaceofstructuraltermsAquerytreecanlikewisebefactoredintostructuraltermsAndrepresentedasaAllowsweightingportionsoftheExample…Example…WeightTheWeightTheassignmentoftheweights0.6and0.4inthepreviousexampletosubtreeswasCanbemoreThinkofitasgeneratedbyanapplication,notnecessarilyanend-userQueries,documentsbecomenormalizedRetrievalscorecomputation“just”amatterofcosinesimilaritycomputationRestrictstructuralRestrictstructuralDependingontheapplication,wemayrestrictthestructuraltermsE.g.,mayneverwanttoreturnaTitlenode,onlyBookorPlaynodesSodon’tenumerate/index/retrieve/scorestructuraltermsrootedatsomenodesThecatchThisisThecatchThisisallverypromising,butHowbigisthisvectorCanbeexponentiallylargeinthesizeoftheCannothopetobuildsuchanAndinanycase,stillfailstoanswerqueriesTwoQuery-timeTwoQuery-timematerializationofRestrictthekindsofsubtreestoamanageablesetQuery-timeInsteadofenumeratingallstructuraltermsofalldocsQuery-timeInsteadofenumeratingallstructuraltermsofalldocs(andthequery),enumerateonlyforthequeryThelatterishopefullyasmallNow,we’rereducedtocheckingwhichstructuralterm(s)fromthequerymatchasubtreeofanyThisistreepatternmatching:givenatexttreeandapatterntree,findmatchesExceptwehavemanytextOurtreesarelabeledandTextHereweseekadocwithHamletintheTextHereweseekadocwithHamletintheOnfindingthematchwecomputethecosinesimilarityscoreAfterallmatchesarefound,rankbysortingHamletQueryHamlet(StillAdoc(StillAdocwithYoricksomewhereinQueryWillgettoitRestrictingtheRestrictingtheEnumeratingallstructuralterms(subtrees)isprohibitive,forindexingMostsubtreesmayneverbeusedinprocessinganyqueryCanwegetawaywithindexingarestrictedclassofsubtreesIdeally–focusonsubtreeslikelytoariseinJuruXML(IBMOnlypathsincludingalexicontermInthisJuruXML(IBMOnlypathsincludingalexicontermInthisexamplethereareonly14(why?)suchpathsThuswehave14structuraltermsintheHamletTobeortoWhyisthisfarmoreHowbigcantheindexbeasafunctionoftheCouldhaveusedothersubtrees–e.g.,allsubtreeswithtwosiblingsunderanodeWhichsubtreesgetused:dependsonthelikelyqueriesintheCouldbespecifiedatindexCouldhaveusedothersubtrees–e.g.,allsubtreeswithtwosiblingsunderanodeWhichsubtreesgetused:dependsonthelikelyqueriesintheCouldbespecifiedatindextime–areawithlittleresearchsofarMicrosoft2MicrosoftWhywouldthisbeanydifferentfromjustBecausewepreservemoreofthestructurethataquerymayWhywouldthisbeanydifferentfromjustBecausewepreservemoreofthestructurethataquerymayMicrosoftReturntothedescendantReturntothedescendantNoknownQueryseeksGatesunderinthevectorDeviseamatchfunctioninthevectorDeviseamatchfunctionthatyieldsascorein[0,1]betweenstructuraltermsE.g.,whenthestructuraltermsarepaths,measureThegreatertheoverlap,thehigherthematchCanadjustmatchforwheretheoverlapHowdoweHowdoweusethisinFirstenumeratestructuraltermsintheMeasureeachformatchagainstthedictionaryofstructuraltermsJustlikeapostingslookup,exceptnotBoolean(doesthetermexist)Instead,produceascorethatsays“80%closetothisstructuralterm”,etc.Then,retrievedocswiththatstructuralterm,computecosinesimilarities,etc.ExampleofaretrievalMatchST=ExampleofaretrievalMatchST=StructuralNowranktheDoc’sbycosinesimilarity;e.g.,Doc9scores0.578.ClosingButwhatexactlyisaClosingButwhatexactlyisaInasense,anentirecorpuscanbeviewedasanXMLdocumentWhatareWhataretheDoc’sintheAnythingwearepreparedtoreturnasanCouldbenodes,someoftheirchildrenWhatareWhatarequerieswecan’thandleusingvectorspaces?FindfiguresthatdescribetheCorbaarchitectureandtheparagraphsthatrefertothosefiguresRequiresJOINbetween2RetrievethetitlesofarticlespublishedintheSpecialFeaturesectionofthejournalIEEEMicroDependsonorderofsiblingCanwedoCanwedoYes,butdoesn’tmakesensetodoitcorpus-Candoit,forinstance,withinalltextunderacertainelementnamesayChapterYieldsatf-idfweightforeachlexicontermunderanelementIssues:howdowepropagatecontributionstohigherlevelnodes.SayGateshashighIDFundertheAuthorHowSayGateshashighIDFundertheAuthorHowshoulditbetf-idfweightedfortheBookShouldweusetheidfforGatesinAuthororthatinBook?SQLforSQLforUsageHuman-readableData-orientedMixeddocuments(e.g.,patientReliesXMLSchemaTuringXQueryisstillaworkingTheprincipalTheprincipalformsofXQueryexpressionspathelementFLWR("flower")listdatatypeEvaluatedwithrespecttoaFOR$pINdocument("bib.xml")//publisherLETFOR$pINdocument("bib.xml")//publisherLET$b:=document("bib.xml”)//book[publisher=$p]WHEREcount($b)>100RETURN$pFORgeneratesanorderedlistofbindingsofpublishernamesto$pLETassociatestoeachbindingafurtherbindingofthelistofbookelementswiththatpublisherto$batthisstage,wehaveanorderedlistoftuplesofbindings:WHEREfiltersthatlisttoretainonlythedesiredRETURNconstructsforeachtuplearesultingQueriesSupportedbyQueriesSupportedbyLocation/position(“chapterSimple/play/titlecontainsPathtitlecontains/play//titlecontainsComplexEmployeeswithtwoSubsumes:WhataboutrelevanceHowXQueryHowXQuerymakesAlldocumentsinsetAmustberankedabovealldocumentsinsetB.Fragmentsmustbeorderedindepth-first,left-to-rightorder.XQuery:OrderByXQuery:OrderByfor$dinlet$e:=document("emps.xml")//emp[deptno=$d]wherecount($e)>=10orderbyavg($e/salary)descendingreturn<big-dept>{$d,XQuery:OrderXQuery:OrderByOrderbyclauseonlyallowsorderingbySaybyanattributeRelevanceIsoftenCan’tbeexpressedeasilyasfunctionofsettobeIsbetterabstractedoutofqueryformulation(cf.UniversityofUniversityofGoal:opensourceXMLsearch“Returnable”fragmentsareE.g.,don’treturna<bold>sometext</bold>StructuredDocumentRetrievalEmpoweruserswhodon’tknowtheEnablesearchforanypersonnomatterhowschemaencodesthedataDon’tworryaboutAtomicSpecifiedAtomicSpecifiedinOnlyatomicunitscanbereturnedasresultofsearch(unlessunitspecified)Tf.idfweightingisappliedtoatomicProbabilisticcombinationof“evidence”fromatomicunitsXIRQLXIRQLAsystemshouldalwaysretrievethemostspecificpartofadocumentansweringaquery.Examplequery:<chapter>0.3<section>0.8XQL0.7syntaxReturnsection,notAugmentationEnsureAugmentationEnsurethatStructuredDocumentRetrievalPrincipleisrespected.Assumedifferentqueryconditionsaredisjointevents->independence.er)*P(XQL|section)–n)=0.3+0.6*0.8-0.3*0.6*0.8=0.636SectionrankedaheadofExample:AssignExample:AssignallelementsandattributeswithpersonsemanticstothisdatatypeAllowusertosearchforwithoutspecifyingXIRQL:RelevanceXIRQL:RelevanceFragment/contextDatatypesSemanticXMLXMLNativeXMLNativeXMLUsesXMLdocumentaslogicalShouldPCDATA(parsedcharacterDocumentContrastDBmodifiedforGenericIRsystemmodifiedforXMLIndexingandMostnativeXMLIndexingandMostnativeXMLdatabasestakenaDBNoIRtyperelevanceOnlyafewthatfocusonrelevanceDatavs.Text-centricDatavs.Text-centricData-centricXML:usedformessagingbetweenenterpriseapplicationsMainlyarecastingofrelationalContent-centricXML:usedforannotatingRichinDemandsgoodintegrationoftextretrievalE.g.,findmetheISBN#sofBookswithatleastthreeChaptersdiscussingcocoaproduction,rankedbyPriceDatastructuresDatastructuresforXMLAverybasicDatastructuresforDatastructuresforXMLWhataretheprimitivesweInvertedindex:givemeallelementsmatchingtextqueryQWeknowhowtodothis–treateachelementasadocumentGivemeallelements(immediately)belowanyinstanceoftheBookelementCombinationoftheParent/childNumbereachParent/childNumbereachMaintainalistofparent-childE.g.,Chapter:21EnablesimmediateButwhatabout“thewordHamletunderSceneelementunderaPlayGeneralpositionalViewtheXMLdocumentasatextGeneralpositionalViewtheXMLdocumentasatextBuildapositionalindexforeachMarkthebeginningandendforeachelement,PositionaldroppethunderVersePositionaldroppethunderVerseunderPl6y.SummaryofdataSummaryofdataPathcontainmentetc.canessentiallybesolvedbypositionalinvertedindexesRetrievalconsistsof“merging”Allthecompressiontricksetc.from276AarestillComplicationsarisefrominsertion/deletionofelements,textwithinelementsBeyondthescopeofthisINEX:aINEX:abenchmarkfortext-XMLBenchmarkforBenchmarkfortheevaluationofXMLAnalogofTREC(recallConsistsSetofXMLCollectionofretrievalEachengineindexesEachengineindexesEngineteamconvertsretrievaltasksintoInXMLquerylanguageunderstoodbyInresponse,theengineretrievesnotdocs,butelementswithindocsEngineranksretrievedINEXForINEXForeachquery,eachretrievedelementishuman-assessedontwomeasures:Relevance–howrelevantistheretrievedCoverage–istheretrievedelementtoospecific,toogeneral,orjustrightE.g.,ifthequeryseeksadefinitionoftheFastFourierTransform,doIgettheequation(toospecific),thechaptercontainingthedefinition(toogeneral)orthedefinitionitselfTheseassessmentsareturnedintocompositeprecision/recallmeasuresINEX12,107INEX12,107articlesfromIEEESociety494Averagearticle:1,532XMLAveragenodedepth=INEXEachINEXEachtopicisaninformationneed,oneoftwokinds:ContentOnly(CO)–freetextContentandStructure(CAS)–structuralconstraints,e.g.,containmentSampleINEXCOSampleINEXCO<Title>computationalbiology<Keywords>computationalbiology,bioinformatics,genome,genomics,proteomics,sequencing,proteinfolding<Description>Challengesthatarise,andapproachesbeingexplored,intheinterdisciplinaryfieldofcomputational<Narrative>Toberelevant,adocument/componentmusteithe
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2024內(nèi)蒙古藝術(shù)學(xué)院輔導(dǎo)員招聘筆試真題
- 2024云南工貿(mào)職業(yè)技術(shù)學(xué)院輔導(dǎo)員招聘筆試真題
- 2024保山學(xué)院輔導(dǎo)員招聘筆試真題
- 自行車英語課件
- 離婚協(xié)議中子女監(jiān)護(hù)權(quán)變更及教育費(fèi)用分擔(dān)細(xì)則
- 車輛租賃安全責(zé)任及事故處理合同
- 餐飲股份戰(zhàn)略合作框架協(xié)議
- 常州消防工程安全監(jiān)督與現(xiàn)場(chǎng)管理合同
- 高新技術(shù)企業(yè)部分股權(quán)出讓協(xié)議范本
- 出租車行業(yè)股權(quán)收購與品牌戰(zhàn)略整合協(xié)議
- GB/T 18255-2022焦化粘油類產(chǎn)品餾程的測(cè)定方法
- GB/T 11832-2002翻斗式雨量計(jì)
- 防損培訓(xùn)課程之一防損基礎(chǔ)知識(shí)
- GA/T 1147-2014車輛駕駛?cè)藛T血液酒精含量檢驗(yàn)實(shí)驗(yàn)室規(guī)范
- 學(xué)前兒童心理學(xué)論文
- 輪機(jī)英語詞匯匯總
- 溝通秘訣-報(bào)聯(lián)商課件
- 充電樁檢測(cè)報(bào)告模板
- 吊車施工專項(xiàng)施工方案
- 英語詞匯的奧秘·升級(jí)英語版-蔣爭(zhēng)
- NBT 10739-2021 井工煤礦輔助運(yùn)輸安全管理規(guī)范
評(píng)論
0/150
提交評(píng)論