ࡱ> -sDFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ @\p Joe&Joannie Ba=& =xL;}(8u@"1Arial1Arial1Arial1Arial1.Times New Roman1Arial1Arial1 Arial1Arial1Arial1Arial1Arial1Arial1Arial1Arial1Arial1 Arial1Arial1Arial1Arial1Arial1$Arial1Arial1Arial1JArial1hArial1hArial1hArial1hArial"$"#,##0_);\("$"#,##0\)!"$"#,##0_);[Red]\("$"#,##0\)""$"#,##0.00_);\("$"#,##0.00\)'""$"#,##0.00_);[Red]\("$"#,##0.00\)7*2_("$"* #,##0_);_("$"* \(#,##0\);_("$"* "-"_);_(@_).))_(* #,##0_);_(* \(#,##0\);_(* "-"_);_(@_)?,:_("$"* #,##0.00_);_("$"* \(#,##0.00\);_("$"* "-"??_);_(@_)6+1_(* #,##0.00_);_(* \(#,##0.00\);_(* "-"??_);_(@_)"Yes";"Yes";"No""True";"True";"False""On";"On";"Off"yyyy\-mm\-dd\ hh:mm dd\-mmm\-yy yyyy\-mmm\-dd                + ) , *      " 8@ @ 8 @ 8 @ 8 @  8@ 8 8 8 8 8 8 8 8@ 8   "   " "       "\ "T "\ "T 1" 1" "\ "T "X "P ! !8@ !8@ "\ "T  " "x@  "     "8@  " !8  "0  @ "0@ "8@ "8 "x  ` "8@   @  !8@@ !x@  `  !x @  "x   `   `  "8  @ "0 "0  @ !8"@@ "8"@  "8"@  "@  " @ "p "p  "p  "p  # "x  "8  ( "8 "8  @ "x  "x  "8 " @ "4@ "t@  $@ "t@ 1"4@ "t@ "t@  $@  "8 #8   ( ! !8@@ "8@ "8 @  !8@ "0 !8@ !8@ "0 "0 8@ @ 8@ 8@ @ 8@  `Je]Rich's Data (FYI)K] RawRemedyData=`CookedRemedyData`t4.x AllU4.x Esc ؜5.0 AllS5.0 Esc5.5 AllF5.5 Esc5.6 All95.6 Esc@ Bubble-All!_ Bubble-Esc) BubbleGraphT)GraphData(Qtrs) ) By Release*Esc By Release2*Ingres v. OraclepM*Esc Ingres v. Oracle    CC: CC: CC:CC:  CC: CC:  CC: CC< CC<CC: DD: DD:DD:DD: DD: DD: DD: DD< DD<DD:EE: EE:EE:EE: EE: EE: EE: EE< EE<EE:FF: FF:FF:FF: FF: FF: FF: FF< FF<FF:GG: GG:GG:GG: GG: GG: GG: GG< GG<GG:HH: HH:HH:HH: HH: HH: HH: HH< HH<HH:II: II:II:II: II: II: II: II< II<II:JJ: JJ:JJ:JJ: JJ: JJ: JJ: JJ< JJ<JJ:KK: KK:KK:KK: KK: KK: KK: KK< KK<KK:LL: LL:LL:LL: LL: LL: LL: LL< LL<LL:MM: MM:MM:MM: MM: MM: MM: MM< MM<MM:NN: NN:NN:NN: NN: NN: NN: NN< NN<NN:`iZR3  @@   #0DA failed because indexes weren t there. Unclear if they were dropped or not created. We provided a script to (re)create them. Could not follow up on root cause.PROCESS -UNKNOWN? DISK LAYOUT Reports taking a long time. Customer was using software RAID 5 via Solaris disk suite, which is  a known no-no. Customer changed this, and moved redo logs to reduce disk contention. We also pointed them to a doc on how to lay out disk drives for Oracle efficiently.LOAD BALANCINGySlow AAG reports. Support went on site and changed the load balancing of scheduled reports, which corrected the problem.Poor TopN report performance. It looks like customer disk drives are not laid out optimally (eH and Oracle on same disk.) Ticket in MoreInfo. SIZING KBMigrating 5.0 to 5.6.5  sizing spreadsheet said they needed 48 GB but nhComputeDiskSpace is saying 400GB. 1500 elements, 52 weeks raw data. Support helped them get their DB loaded  127 GB. They could not provide the sizing files for review. SIZINGS Three 5.5 customers getting  Unable to connect to DB in nightly jobs. Root cause?H Customer upgraded cluster to 5.6.5. On one cluster member, replication failed because it ran out of disk space. We were unable to pursue the root cause. To resolve the problem, & ???ProServ created an LCF file for a 2MM 5.0.2 to 5.6.5 migration. Wanted some info on it. We worked with them and determined that it is safer and more accurate to have the installer create the LCF file, rather than doing it manually. They did this successfully. ORACLE PARMSOracle crashed  ran out of shared pool space due to fragmentation. DB corrupted. Customer had to reload DB from backup. We updated the Oracle shared pool size in 5.6.5 P2, which has addressed the issue. CODE - TUNINGORACLE ERROR - KBUSave failed in eH 5.5. Known Oracle problem in Oracle 8.1.7  they must run nhReset >weekly to address the issue. Fixed in eHealth 5.6 (Oracle 9.)^Repeat of 37077 above  Oracle crash due to shared pool fragmentation. Addressed in 5.6.5 P2. Oracle shared pool space problems  increased cursors and shared memory parameters. How is this different than 37077  it has a different fix& UDBAPI on systems with more than 30 days raw causes shared pool memory errors. Design issue in DBAPI. We changed tuning parameters to get to 30 day support with DBAPI. We cannot support > 30 days raw. Long term (significant development) effort identified.lSame issue as 37798  shared pool issue space issue with DBAPI. Changing pool parameters addressed problem.TUNINGNOT ENOUGH INFOError in DB conversion in upgrade from 5.5 to 5.6.5. Error initializing standalone poller. It looks like an eHealth process had a table locked that the SA poller needed& ? Customer abandoned effort and went back to 5.5.UNKNOWNTBDNhManageDbSpace could not move a DB file that was > 2GB, because the eHealth framework does not support such files. In progress.TMigration 5.0.2 to 5.6.5, rpt files not migrated properly. Install team has it now.95.6.5 install failed. Oracle installation failed. Had customer clean system and start again. Success. No info on why original install failed?iIMM migration to 5.6.5 failed. Oracle patch install failed. Could be a leftover shared memory segment from a previous failed attempt. Customer got beyond this issue and then ran into 38599 belo.woSame situation as 38509  createDb failed. Customer ended up installing on a different system which went fine.p5.6.5 ascii load during move from Windows to Solaris failed. Upgraded the Sol system to P2 and the load worked.wInstall of 5.6.5 P2 failed on convert.  convert state was wrong. Reset convert state and re-applied patch  success.4NhExportData is slower in 5.6.5 than 4.8. MoreInfo.iDB save fails due to lack of disk space, although the save should not require that much space. MoreInfo.Prob-Id Submitter Create-dateImpacted CustomerStatusSeverityShort Description Root CausesnormanR-4.6.0 DB Create Paine WeberFixedCritical*Database writes for TA are taking too longschapmanDB LoadBelgacomNoDupl+Scheduled save overwrites command line loadrkevilleR-4.6 P2 DB Roll-upAllied Riser OperationsClosedHDatabase has duplicates after running cleanStats script, Indexing fails.foconnorR-4.7.1Mobilix*Unable to enable jobs via the command lineHigh2Conversation Rollups running for very long periods R-4.7.1P1AIGcDr. Watson error: NhiDialogRollup.exe Exception: stack overflow (0xc00000fd), Address 0x704ff03af DB RecoverGov of British ColumbiaIUnresolved deadlock causing nethealth database to be marked inconsistant.R-4.6 P3 DB VerifySouthwestern Bell InternetCRepeated QEF errors in errlog.log file, rollups failing constantly.yzhangAxilanzError: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. DB StatusAlcanet International*Disk file read error on database:nethealthjpoblete DB ServerCompuCom Systems, Inc.NoBugMedium*nhDbServer dies after discovering elementsR-4.6 P5MCIWC-WMS ManagednhiPoller hungsR-4.5.1;NPS is the reseller, American Home products is the customerIDialog rollups failing: Error: Append to table nh_dlg1b_966311999 failedtcordesDBS7Data rollup causes data to be reduced by a factor of 10wburkePrice Waterhouse Coopers LLP,iimerge running for over 24 hours @ 24% cpu. R-4.7.1P2KPN-Deadlock errors lead to Database InconsitencyR-4.7.2DB Save GetronicsUnrecoverable DMT show error. R-4.8.0P4Johnson & Johnson + Red CrossConversation Rollup hungs. tstachowiczDimension Data (DDMS)<nh_daily_exceptions error for database save and ata analysisR-4.6 P6(Siemens Telecom (form . Siemens Nixdorf)]Database save, Data analysis and reports failing with nh_daily_health table and page -1 errorR-4.8.0Computer Associates$Dr. Watson errors on dialog rollups.Sprint (Reston) (Stats. Rollup failing, d< ups on nh_stats0rrick DB BackupEmory UniversitynhLoadDb error R-4.8.0P2 Sprint PCS-ITIngres Two gb limit on database:nethealth table:nh_element pathname:/nh04/idb/ingres/data/default/nethealth filename:aaaaaaald Union Tribune Publishing Co.nhSaveDb fails silenetly.Computer Sciences CorporationServer stops during DCI mergeJet Propulsion Laboratorynh_element table hit 2.15 gB.cpaschalWyeth Ayerst (aka Americanzdatabase load succeeded with errors: Non-Fatal database error on object: NH_DAILY_SYMBOL, E_QE0083 Error modifying a table jnormandinMetLife7Database status hangs and eventually returns bogus dataMCI;Customer wants to fill DB gap with data from other DB save. R-4.7.1P3Merck & Co., Inc.Ingres keeps crashingAmerican Skandia(Problem: Database crashes every few daysOneSystem Group/Farmland7nhSaveDb using ascii format stops with and error,,,,,,, R-4.8.0P32Nethealth Servers stopped due to ingres Stack DumpComTechData analysis failingBritish Telecom BarclaysKNeed procedure to change element_id on remotes to match the central server.cestep TeleGlobeServer stopping unexpectedlyUnited Technologies Corp.'database save and data analysis failing DB Ingres Eastman Kodak)nh_element table has reached 2 gig limit. R-4.7.2 P1Qwest Communications -MDSnhSaveDb hangs.%US Department of Justice/JMD DivisionIngres will not come up.shagarFortis Bank (form ASLK).Getting Dr. Watson error due to stack overflowFederal CommunicationsNethealth failes to start.DAC tables seem to be corruptklandryJP Morgan Chase Bank,Seeing r-norm for newly discovered elements.DB OtherInstinetEDatabase marked inconsistent after customer runs nhDbStatus from GUI.Ingres Stack Dumps R-4.8.0P6 Swisscom AG'Statistic Rollups take 3 days to finishTDuplicates created in the database on Central Server after nhReset -db is performed.tfullerRalston PurinaIngres database memory errorsFetches take 22 minutes beta programB1-5.5 DB Oracle PWC GLOBALIPWC: Database Status window does not reflect the database disk freespacebmillerR-5.0.0TECORLE Browser does not display technologies or groups under 'All Technologies' screen Betaprogram"PWC: Scheduled maintenance failureR-5.0.2Lock quota exceeded at 1000. Bear Stearns$nhiDialogRollups fail on dlg1 table./PWC: Conversation Rollup Failure every 4 hours.Equant'Deadlocks on nethealth database (SPVD).Vanguard GroupDeadlock errors on the DB.Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. B2-5.5PWC: upgrading 55B1->55B2 error msg when trying to run the INSTALL.NH script for beta2 install. resolved by adding a ingres userEQUANTrEQUANT: The elements are being added, but only one group is being created (cfcnet) with one device in it a038-RH). ALCATEL USACLive Exceptions History job is creating errors in Ingres error log egirardin5.5-P4Countrywide Home Loans]Sql Error occured during operation nhQuery5: (ORA-01410: invalid ROWID) for Fast-Live Poller R-5.0.2P6Statistic Index job failingdon"disk configuration recommendationsAlltel=Alltel's REDO LOGS too small: delete archive logs not workingDefense Contract Management nhiLoadDb failsCSELTRollup failing system downMCI- Managed ServicesVForensics when enabling NH_MANAGE_STAT_DUPS are causing problems and calls to support.SQL errors prevent polling.British Telecom ExactOWhich DB table can customer get relationship between group_type and group_name.5.5-P5Alltel Communications Inc.Vdata analysis is taking too long, which causes the miss of other jobs and server resetDESIGNdagrayDimension DatawDB Load of a DB saved with 5.0.2 P6/D6 to a new 5.0.2 P6/D6 machine does not copy web/webCfg/users.cfg and passwd filesahowardMacQuarie CorporateNegative values in eHealth DBVerizon-Bedminster, and more)tech support needs a tablespace collapser"Database save fails intermittentlySiemens Westinghouse+Permissions issue with Oracle installation.!eHealth spontaeneously restartingHReceiving nhiLiveExSvr: Sql Error in the system.log file every 5 minutesIRS>nhiIndexStats is getting overwhelmed and the nhServer crashes. VR Kreditwerk0Is there a large difference between P02 and P06?nhReset does not restart OracleDimension Data (Netherlands)?Error converting device_speed during 4.8 to 5.0.2 database loadverizon Kaiser`unable to create new group from the imported elements fater running nhiImportGroup (split/merge)!Siemens Medical (Health Services)#AAG report failed with ingres error)unable to start oracle after system crashFleet3Statistics Rollups failing due to duplicate AR data Unisys Corp+why oracle was running out of shared memoryVerizon WirelesssWith the NH_MANAGE_STAT_DUPS functionality customer is experiencing server hangs due to nhiIndexStats scheduled jobITXC:Does Sol 2.8 to Sol 2.9 require a relink of Oracle exes? Queue consulting>Data from one remote is not getting inserted into the databaseNortel (POC), Dave Tin (CCRD)?Error launching the Oracle 9.2.0.3.0 Patch Set install program:Verizon-Bedminstertablespace fragmentation issueQwest:Looks the stats poller get stucked during the stats rollupTGS Gmbh, Cendant CorpK5.6.1 Beta 6 upfgrade having problems with their 5.5 Beta DbSave save fails MCI-Nasdaqiimerge is at 40% - 50 %.SiemensnhiCfgServer exitsSeemingly benign errors in the5.0.2 Cert D06ICS GMBHQCertification D06 removes all entries in $NH_HOME/db/data/elementTypeVariable.usr=SQL error regarding nh_stats_poll_info table after a discoveraatkins;nhiCheckOracle errors after 5.5->5.6 upgrade and reboot R-5.0.2P8ING Nederland ITC@Data Analysis getting error messages: Unexpected database errorR-5.6 nhNameNode failed with sql error$Qwest Communications- IP Engineeringiirtemp corruption in Ingres Telefonica6ehealth Server is hanging once every two to three daysLow,Corrupted missed_polls data in STATS0 tables R-5.0.2P7Aliant Telecom Inc.(fetch fails with sqlca.sqlcode: -49900 Fidelity Investment Servicesiidbms is at 100%.SyncrudeDb error when renaming group. Verizon - Kaiser/problem of NH_MANAGE_DUP_STATS with 502 patch07 CODING_ERROR5unable to start oracle due to a disk block corruption!R-5.5 R-5.6 R-5.6.1 5.6.5M1 6.0M1 (HP Custs)0bug in nhCreateDb on HP when getting phys memoryR-5.5 R-5.6 R-5.6.1 All 5.5 Custs6long saves in 5.[56] will be CORRUPTED by log scrubberWashington Mutual&problem with Migrating Scheduled jobs.Edge On/Belgacom Install failsnhSaveDb failsAlltel Communications, Inc.hDiscover to group - group creation and population fails after group was deleted and fsa scrubber is run.Getronics WangStatsitics Rollups failingFactSet Research SystemsBackfill job appears to be hungInternational Truck and EngineFnhLoadDb failed during migration from eHealth 5.0.2 to eHealth 5.6.0.aR-5.6.5 NOAA / US Department of CommerceProblem with third Oracle CD. Equant FranceORA-01403: no data found0Does nhiPoller use NH_MANAGE_DUP_STATS variableError starting eHealth.3Delete Archive Log job is not deleting archive logs0Stack dump messages occuring when ingres crashesSAP AG4conversation rollup failure with transaction abortedCompaq Computer Corporation'the report was failed with ingres error Veitsch-Radex0Live Exceptions History job fails with Db error.+Computer Science Corporation North America.nhiClearEventHistory failed with query aborted Schering AGvPDF-Files are not generated with the following error messages :Too few operands/An unrecognized token '-NaN' was found5.6.5B1 Amasol AG8PR: Can't assign standard variables to new element typesR-5.0.2 R-5.0.2P73The Data Analysis job hangs after loading a new db. 5.6.1 beta$Rollups failing with ORA-00001 errorHuntington National Bank< I'ORA-00001: unique constraint (EHEALTH.NH_STATS0_1065095999_IX1) violat';B1-5.6(oracle dbsave failed with internal errorNavigation Technologies[Customer received pop up error during Vision install portion of 1MM 5.0.2 - 5.6.0a on W2K OCustomer received error during convertDb portion of 1MM 5.0.2 - 5.6.0a on W2K )Motorola Israel Information Systems, Ltd.DOracle will not startup after an apparently successful installation.DDGSOA Dimension DataEPoller.cfg is not imported correctly at end of nhImportGroup command.~After split/merge when adding new group, console window shows group was added - reopen the group UI and group is not present AT&T Wireless Services3Unable to online ablespaces NH_DATA01 and NH_DATA02CalenceProblem with data analysisFleet Boston FinancialStatistics Rollups failingSprint PCS-OSSN@How and how often are the Oracle tables and indexes are analyzedUnion Bank of California nhComputeDbSpace library errors.#AR data created FUTURE STATS0 table)Scheduled database load job doesn't work.Verizon Wireless-Bedminster0security alert patches 40,42,48-51,51 for oracleState of Nevada.Upgrade from 5.0.2 to 5.6 I receives an error.Aon8Customer is getting duplicates after applying 5.0.2 P07.-stats rollup failure with recursive sql errorVodacomUconverting column overflows integer datatype error from LiveEx Server after 5.6.5 2MMareagan5.0.2 Cert D07#Computer Science Corporation - EMEAGServers crash with discovery of elements with a ' in the uniqueDeviceIdRegal CineMedia3upgrade from ehealth 55 to 56 failed on nhConvertDbT-Systems BilisimStats Rollup failure.MBNA CorporationanhCreateDb in 5.6.5 fails, perhaps because it believes it is doing a migration any time it is runGenesis Communication3No Application Response Paths appear in web consoleGoodyear Tire & Rubber Co.!Sizing spreadsheet with no Excel.BT Ignite C&MMS1Errors when running nhManageDbSpace -createLcf. Wan Technologies.nhLoadDb fails on eHealth migration from 5.0.2Concord- Engineering4nhUpgradeOracleDb fails from a 5.5 to 5.6.5 upgrade.%T-System Dusseldorf Project BarmerNetL5.6.0a to 5.6.5 upgrade fails on convertDb at nhiDbTasks -doSavePriorVersion)New York State Workers Compensation BoardOracle database save failing6stats index and stats rollup failed with unusual errorHMoving /opt/eHealth 5.5 before upgrading to 5.6.5 causes upgrade to fail R-5.6 R-5.6.5TGS Telonic Gmbh2Sql error when running nhNameNodes against a view.HOracle Listener startup must be added to failover with High Availability REQUIREMENTS5.5-P6}nhiDataAnalysis failing: Append to table nh_elem_outage failed (ORA-00001: unique constraint (EHE00P1.NH_ELEM_OUTAGE_PK) BT Ignite-Cluster migration has failed on DB conversionyDelete database archive log job failing with error message: "quitting because V$SESSION_LONGOPS indicates a SAVE running"Cendant Corporationo"Although it appears that there is enough disk space..." message....yet we still fail the customers install! 4Database save failing with too many corrupted blocksPehealth502 installation hangs right on upgrading ingres pach to 6793 DeTeCSM, Bielefeld;System log contains only 1 line before creating backup file+nhConvertDb fails during save prior version4ehealth565 install failed on installing oracle patch-AT&T Solutions, Raleigh ,NC/ Concord Pro.ServGOwnership of Oracle alert log is changing during weekly maintenance jobRabobankQ"ORA-00054: resource busy and acquire with NOWAIT specified" error in nhDbMaint Fort Huachuca - TNOSC@One element is showing raw data that should have been rolled-up.7nhiClearEventHistroy causes ingres errors in errlog.log!Unable to ASCII save the databasenhStopDb hangsnhUpgradeOracleDb failsAdvanced Network ProductsqA Trend report for one element showing BW in and out does not run for anytime over 24 hours: duplicate keys foundmcoatesEds Australia Pty LtdStats roll up failuregtarpy<nhExportData takes more time than with eHealth 4.8 (Ingres).Install hung on nhCreateDb)Unique Constraint Problem with nhiPollerAllstream Inc.SScheduled database save fails with invalid "insufficient space on location" messagePoor TopN Report performance Concord CoE=Custom AR profile rules for LE are dropped from db on upgradeicustom Health report's baseline value changed from 4 weeks to 6 weeks after migration from 5.0.2 to 5.6.5WesCorp Federal Credit Union7second console user can not run reports due to db error Siemens AG(availability and reachability from DbApi%Minnesota Department of Revenue (DOR)Assignedunable to edit MyHealth reportsTNS listener security issues.ZReceiving "unabled to evaluate" messages from nhDbStatus in 'free space on device' column.Borders Group Inc.Customer Verified75.5 -> 5.6.5 upgrade fails with oracle LCF-00011 error.POC for Minn Dept. of Rev6cannot add custom variables due to nhConvertDb failure CODE - NON-DB0nhConvertDb fails after installation of one-off.DeHealth 5.6.5 Fresh install on Win2000 (sp4) failing with odd errorsCanadian Waste Services Inc. Unable to createDb on 5.6.5. W2K7data analysis failure on appending table nh_elem_outageYamanouchi Consumer Inc.UCollection time for 'xxxxxxxxxxxxxxxxxxxxx' overlaps the data from a previous poll. 1Livehealth failing with Unique Constraint errors.7Naval Oceanographic Office Major Shared Resource CenterqNew install of 5.6.5 third CD ( oracle error in invoking target install of makefile ( ora/rdbms/lib/ins_rdbms.mk)Data Analysis failureE"Division by zero" error when using nhExportData with LANWAN elementsdatabase creation failedUnisys Statistics Rollups are failing #data analysis failed with new erroroDCMA: I am unable to load a saved database from the eHealth console or from the eHealth server`s command line.iSTATE FARM: BETA AE - does not support manually entering the (hard coded) value for $NH_USER and $NH_HOME5British Telecom: Segmentation Fault - Core Dump ErrorPoller hangs intermitentlyState Farm Insurance Co.UDatabase Backups keep failing when unloading sample data and on different dlg0 tablesVerizon ESG-Frazer`E_SC0345_SERVER_CLOSED - Server not accepting connections because of mai ntenance or shut down.NetSol International Argentina!sementation fault on upgrade. 4.8SPV[DataAnalysis core dumps, Errorlog reads out of memory, although there is plenty .5 gB left. R-5.0.2P2 Bell Canada-Db load from 4.71 into 5.0.2 does not convertSBC Internet Services-Statistics Rollups fails with Duplicate keys.mmcnallyEquant - SwedenJStarting console,server or performing dbSave fail with segmentation fault.R-5.5 AMASOL AGxAMASOL: after saving the second discovery the server crashed and was starting to "loop" (crash-restart-crash-restart...)UnocalAssociation failuresBus errors on nhiIndexDbJError while converting old custom element types during upgrade 4.8 - 5.0.2.Statistic Rollup and Statistic Index failures.Alcatel InternationalData analysis log gives warning message on poorly defined Health Report jobs even though the job is disabled in the scheduler)Dropped functionality during code changes4Error in syslog: Job step 'Statistics Rollup' failed 5.0 I18N B1 BELL CANADAcBELL CANADA: For NhiIndexDiag -U neth -d ehealth , many index errors , heap vs Btree and wrong key.Garanti Bankasi A.S..Assertion for 'cdbSampleLoopCnt++ == 2' failedAppend to stats2 failsVeHealth 5.5 Beta 5 DB fails to load into either EH 5.5 B5 system or EH 5.5 RTM system.NOAA-Dept Of Commerce9Stack dump in errlog.log file and ingres will not come up{During fetch server crashes with the following error: Assertion for elemPtr failed. Exiting in file cfgServer.c line 2752 mgenestConcord Communications9Ar configuration not preserved during database save/load. PRE-EXISTINGBank of AmericalnhRemoteFetchDb failes to save any element configuration data to the .zip files in the remotestats.tar file.< comcast - customer sensitivity'Statistic Rollup Failure duplicate keys First USADeclinedUndefined Nodes in TA reports. R-4.8.0P9Veritas08-jun-2002 04:02:13 Error ( nhiPoller[Dlg] ) Unable to execute 'MODIFY nh_dlg0_1023519599 TO BTREE UNIQUE ON Unable to executeNetwork Guidance CompanyStatistics Rollup failingGeneral Dynamics$duplicate on insert - nhCleanupNodesVerizon Internet ServicesCAn orphaned Query Tree object was found and destroyed during QSF CeHealth servers are restarting due a crash in the fast live poller: ITC DeltaCom!server crashes with ingres error.Utable permissions are not being set correctly, data cannot be written to stats0 tableBank of America, Cegetel#Rollup fails during index of table.5.5-P15Statistics Data for element when Missed Polls ~ 100 % R-5.0.2P3Brasil Telecomerror on nhiCfgServer(Ingres will not start, partition is fullB5-5.6IPM Solutions Pty Ltd.;The install script does not allow "multiplexing" for Oracle R-4.7.1P4BT Unable to locate shared memory ICS-GMBH $100k+ in Sep '02RError nhiLiveExSvr Pgm nhiLiveExSvr: Sql Error occured during operation nhQuery6184Fetch/merge failing on central server netreporter. mburnsCegetel>dupe stats: ORA-00604: error occurred at recursive SQL level 1<During 'new system migration' nhConvertDb fails with errors.3Unable to run Health Reports for time span > 1 weekjkueflerTelecom ItaliahORA-27102: out of memory error message and Oracle will not start with installed db_block_buffer settingnhServer stopped unexpectedly.Toronto Dominion Bank/nhLoadDb fails on .grp files that do not exist.Ferrors in console from nhiDbServer and nhiLiveExServer after migrationJones, Day, Reavis & Pogue9Database rollups fail with duplicate index error message.nhLoadDb is failing for 5.5oOracle parameters sort_area_retained_size and sort_area_size if not set large enough can cause ORA-01220 error..A new use case for Oracle save/load is needed.ATerrible report performance on eHealth 5.5 Distributed Reporting.nhLoadDb fails with error.Fidelity Investments$Invalid Ingres license from ingstart DB SplitMergeAT&T0Split merge failed during import of stats data .Comcastnh_exc_history duplicate error.NTL=After command line rollups there are now gaps in the databaseN/A?AR agents inserting data into the future and duplicate records.#Pollers freezing on win2k machines. R-4.8.0P11BTExact#nhServer stops, but never restarts. R-4.7.1P8Sprint&Ingres is corrupted and will not start&Susquehanna - escalated by Sue Fanning)Rollups failing with duplicate key error.Getronics Infrastructure.Fatal NI connect error 12560, connecting to: EtisalatDbSave not WorkingUnable to stop or start OraclebCustomer sees iimerge process taking up about 40-50% cpu and reports take considerably long to runufigwer5.5-P2unocal1dbSave not sufficient to restore AR configurationFabout 21000 elements created during server stop without knowing reasonEchoStar"Rollups Failing with error message 5.5 Cert D02?Database Conversion during 5.5 D02 upgrade failed. System Down. R-5.0.2P4nhiDbServer crashed and coredCounty of SacramentobnhLoadDb failed with error: Database error: ERROR: SQLCODE=-1012 SQLTEXT=ORA-01012: not logged onPershing DivisionCan't connect to database.Research in Motion+ingres crash with LKrelease() failed error1Oracle database would not restart after stopping.BMWvThe problem here is that the checkpoint save takes half an hour, and during this time no statistics data is collected.&Console is slow after applying 5.5 P02!stats index coredump or error out"NTT Communications Corporation SLA2upgrade from 48 to 502 failed with duplicate errorHealth reports fail with the error: Fatal Error: Assertion for 'exText' failed, exiting (Not all exception_element info_id valueANZ'nhFetchDb is extremley slow on 5.02 P04Bank of America-,Rollups failing after applying fix for 23597AAA Cooper Transportationuser can not run AAG reportmwickham5.5-P2A-Cegetel)Server is in a restart loop after DB loadMError:"Error: Append to table nh_stats2_1026619199 failed during stats rollupTrend report failing with errorB2-5.6Beta1 TA customers are broken%sizing spreadsheet needs 6 rows fixedVerizon wireless bedminster Need to balance IO on 5.5 systemSBC Internet Service2004-Q12004-Q22001-Q12001-Q22001-Q32001-Q42002-Q22002-Q32002-Q42003-Q12003-Q22003-Q32003-Q42002-Q14.x5.05.55.62000-Q22000-Q32000-Q42000-Q11999-Q11999-Q21999-Q31999-Q41998-Q31998-Q42004-Q32004-Q4CreatedIngres (4.x, 5.0)Oracle (5.5, 5.6)Verizon noticed some v5.5 DB tables consuming more than 1,000 DB extents. We anticipate addressing this issue in Trident. DOES IT CAUSE A PROBLEM? ENHANCEMENT? }CCustomer wanted info on where to get Oracle client and ODBC driver.sCustomer wanted info on using NFS-mounted databases. We pointed them to the list of Oracle supported NFS products.2 machine migration to 5.6.5 seemed to fail. Actually, it produced error messages, but everything worked fine. The error messages were caused by bad  stty commands in the customer s .kshrc file. The customer thanked us for finding the .kshrc errors.CreateDb failed on six servers. Customer had a script on each server that detected long running  nh processes and killed them. These were killing off the nhCreateDb attempts.YUnable to start nhServer. Problem was the license server was not started after a reboot.StatusChangeTimeSoftware RevisionProduct ComponentEscalated TicketProbT0000004258R-4.0.1e6nhDbStatus does not show correct amount of free space.ProbT0000004377B-4.1.0DB Save System Logsystem log should be a fileProbT0000004422R-4.0.1bcDB status gives incorrect free disk space amounts, causing polls not to be inserted in the DatabaseAnow use double to report free space bytes instead of unsigned intProbT0000004454adavisR-4.1.0mfScheduled of GUI D/B Saves should have unique identifiers so as to avoid overwriting previous saves. ProbT0000004500egirardiRollups fail after 12 hours with force abort error message due to log file size exceeded, Ingres log file 512MB: (5100 elementsyBefore upgrading to 4.5, the client must move key environment variables (TZ)to nethealthrc.sh.usr and nethealthrc.csh.usrProbT0000004594TAll data prior to Daylight savings time change is now rolled up into 1 hour samples.PCorrected cdbUtils specifically routines revolving around calculations <= 1 hourProbT0000004661;Would like option to turn off the scheduler during nhLoadDbProbT0000004727SThe date in the "Last Rollup" does not mean last "successful" roll-up and it shouldProbT0000004786B-4.1.0 R-4.1.0mSave Database by groups.ProbT0000004903Y2K bug in nhRemoteSaveDbProbT0000004958uWhen a database is saved on NT, then loaded on Solaris in a Timezone other than Eastern, the data appears to shift. ProbT0000005054R-4.1.5-All groups are not populated on database loadProbT0000005085B-4.1.5 DB ConvertvCustomer requests that Ingres database conversions are separate from the upgrade to later revisions of Network Health.ProbT0000005132/Would like to see a time stamp in the save log.ProbT0000005137R-4.0.0qSelective backup or restoreProbT0000005239skeene/Statistic rollup per cross technology profiles.ProbT0000005380qTo ability to use a relative path instead of a full path with the nhSaveDb [-u user] -p path< [-backup] [-ascii] ProbT0000005525DnhReset changes the permissions on the errlog.log file to READ only.ProbT0000005575esearsoneSystem messages shows database is running out of space, when actually approximately 1.4gig remaining.ProbT0000005584R-4.5.01N/H B-4.5 D/B fails to load into N/H V4.5 releaseProbT0000005670Epresentation.var and serviceStyles.sde are NOT migrated with db save.ProbT00000056786Data Base Status incorrectly calculates database size.ProbT0000005685RStatistics Rollup has been failing consistently with duplicate key index problem.(AE cleaned up system after power failureProbT0000005805AFirst conversations rollup of each calendar day fails with error.ProbT0000005834tcraigDB Convert (Merge)aCreate a tool that would automate the merging of data for a retired element with its replacement.ProbT0000005838,Have nhLoadDb check for available disk spaceProbT0000005950BWant to see output from a database load as the database is loadingProbT0000006136ProbT0000006162 squintilio9$NH_HOME/bin/nhDbStatus command giving inaccurate resultsProbT0000006172ICannot schedule a database save for a database named other than nethealthProbT0000006190Error: Unable to execute 'CREATE TABLE nh_stats1_940132799 AS SELECT * FROM NH_RLP_STATS' (E_US07DA Duplicate object name 'nh_sProbT0000006191Q(E_CO003F COPY: Warning: 144 rows not copied because duplicate key detected. ProbT0000006212manthonyDuplicate Rows in stats tablesProbT0000006251CNutcracker seem to be unable to handle Japanese characters on NT. ProbT0000006329YElement Migration Utility returns an error during import if Duplicate keyed tables exist.ProbT0000006356DB Purge{64 bit counters put bogus data in the DB, customer needs tool to remove a couple days worth of bad data for 1 element only.ProbT0000006390?A wse split/merge to recover one or more specific stats tables.ProbT0000006394;Database error: -39100, E_QE0083 Error modifying a table. ProbT0000006487GScheduling several rollup periods based on configured Service Profiles.ProbT0000006516~Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. ProbT0000006569Conversations Rollup failure.ProbT0000006574[Customer wants system log save scheduled automatically with installation of Network Health.ProbT0000006592cError duriing upgrade or load of database: Cannot convert column 'element_class' to tuple format. ProbT0000006645eThe nhServer bounces periodically, in the system log only appears : Server Started Succesfully. ProbT0000006717 DB IndexingT(E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. ProbT0000006758 db indexingS(E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. ProbT0000006788JCannot access table information due to a non-recoverable DMT_SHOW error. ProbT0000006794qE_US1194 Duplicate key on INSERT detected." in the system log with Statistic Rollups and Statistic Index failingProbT0000006824GStatistics Rollup and Exception Reports are failing due DMT SHOW errorsProbT0000006831|He would like this option so that he doesn't have to worry about the scheduler going off immediately after he loads the db.ProbT0000006971ZConversations rollup failure. No error messages in any log except the system messages log.ProbT0000007002_"More Info" : After upgrade to NH 4.5 getting errors: Append to table nh_dlg0_950669999 failed.ProbT0000007059>Statistics rollup fail with error: Rows contain duplicate keysClosed per StephanieProbT0000007067R-4.5.1 P11 D08"Conversations Rollup keeps FailingProbT0000007074deadlock errors in errlog.log *E_DM9045_TABLE_DEADLOCK Deadlock encountered locking table neth.nh_stats0_951767999 in dataProbT0000007085<Full data integrity check on the database when backup occursProbT0000007096@roll ups failed due to dups .See esc. tickets directory for logsProbT0000007097yCustomer would like the -backup option from a command line dbSave to be added as a check box function to the GUI dbSave.ProbT0000007109tbaileyDrollups failing with Sql error E_US1592..rows contain duplicate keysProbT0000007111CAppend to table nh_stats1_950331599 failed, Duplicate Key Detected.ProbT0000007124BStatistics_Index and Statistics_Rollup are failing with Sql ErrorsProbT0000007138JStatistics_Rollups fails with "append to table" and "duplicate key" errorsProbT00000071436Statistics rollup and index fails with duplicate keys.ProbT0000007157+Database is inconsistant after system crashProbT0000007166DDatabase inconsistant after repeated attempts to fix duplicate keys.ProbT0000007255~Internal Error (Configuration Server) Expectation for '!_initCbObj && !_initCbRtn' failed after load of 4.1 database on 4.6 ProbT0000007343Conversation Rollup fails.ProbT00000073486Statistics Rollups failling with append to table errorProbT0000007362boutotteB-4.7.06CbaBaseApp incorrectly parses args for scheduled jobs.ProbT0000007366FRoll ups are failing due to duplicate table nh_stats1_940111199 failedProbT0000007375(Rollups failing with duplicate keys ProbT0000007377(Rollup failed with append to table errorProbT0000007390iidbdb is inconsistant.ProbT00000074144Conversations rollup fails because of duplicate keysProbT0000007417IWould like to see the location of the transaction log in commands and guiProbT0000007437DB Sync Protcols3Planned downtime to accept Groups and/or GroupListsProbT0000007446ERoll ups are failing due to duplicate table"NH_stat1_951865199 failedProbT0000007468Could not create Java virtual machine during Oracle installation. Problem was that customer was trying to install Oracle from NFS-mounted drives, which Oracle does not support.<Enhancement request asking for us to compress ascii backups.iCustomer wanted to know why the maintenance job runs every hour. Answer is to clean up the archive logs.Customer wants Oracle 9.2.0.4. Per our Oracle support policy, there is not reason for us to move to Oracle 9.2.0.4 now. (We are pursuing getting Oracle permission to allow customers to upgrade themselves.)Migration attempt from 5.0 to 5.6.5 on Solaris 2.7. Sol 2.7 is not supported and 5.6.5 executables do not run there. Customer ended up performing two system migration to Solaris 2.9.5.0.2 Remote Poller caused Ingres file to exceed 2GB limit. Customer had group with 80,000 elements in it. Support provided script to drop some data to avoid the problem. After upgrade from Nemo Beta 1 to Beta 2, report tab entries were grayed out. Customer had not followed migration process  loaded a 5.0 DB onto a Nemo system. (Not a DB issue?)Customer had Aview error, mis-applied a one off that overwrite system scripts in /etc/init.d. We reverted their scripts back to v5.0 rev. (Not a DB issue.)F/etc/init.d/httpd.sh script had wrong values in it. (Not a DB issue?)5GCustomer disabled cleanup job, so cluster file transfer table grew hugeDisk drive failure#Systems missing critical OS patches(Support ran wrong version of nhGetCustDbCustomer set year to 2012;Customer did not have Ingres running during backfill in 2MM5Sizing spreadsheet issue addressed with sizing wizard@Request for Oracle security patches that are already in 5.6.5 P2=ProServ copied file between systems without updating the fileEnhancement request Request for info on RAID devices7Extra large createDb wasn t hung, just slow - completed/Customer ran for two days with year set to 2012 BUG NOT DBConvert failed due to duplicate. Problem was profiles in the Live Ex config that have the same name as profiles loaded for patch and cert releases. Cu< stomer had probably exported and imported between two systems not at same rev level. NON-DB CODEa All reports fail after upgrade from 5.5 to 5.6.1. Problem was invalid  .usr files. DB issue? VX5.5 to 5.6.5 upgrade failed. 52 corrupted AR paths in the poller config file. CORRECT?POracle error from SysEdge in Nemo Beta. Info from this PT ended up going into PT 37422. Problem does not occur if Oracle AIM is not loaded. CCRD-South pursuing.DA failure, thought to be related to Split Merge. Ticket marked repeat of 37343, tracking numerous epidemic DA failures caused by indexes not created because of Dups. Stats forensics run, data gathered, and transitioned to poller team.Wrong NH_USER value in fileTReport failure on Japanese 5.6 system. Customer upgraded to 5.6.5 to resolve issue.OLD@Report failure in 5.0.2. Customer migrated to 5.6.5 to address.8Report failure in 5.6.5. Problem solved by applying P2. OLD ISSUEKReport failure in 5.6.5 P1. Ran fine with standard profiles. In MoreInfo.LAR profile rules for LE are dropped on upgrade to 5.6.5. Fixed in 5.6.5 P2.:COMPLEX ACTIVITYTData gaps after load from one system to another, on eH 5.5. Need more info on cause=PROCESSDROPPED INDEXESProbT0000008432 stats roll-upProbT0000008467/Statistics Rollup failing due to duplicate keysProbT0000008532<Roll ups failing Append to table nh_stats1_953873999 failedProbT0000008533]Server down- rollups failing on nh_stats1_953791199 failed customer hit the wall at 96% usageProbT0000008653ProbT0000008654%Rollups failing due to duplicate keysProbT0000008675Stats rollups failingProbT0000008711IRoll up's failing on unknown table. Customer to send roll up log to Jose.ProbT0000008800 DB Sync MapHWould like nhExportConfig to work with -subjName and -elemType argumentsProbT0000008850 table nh_stats1_958017599 failedProbT0000008874>Table could not be indexed because rows contain duplicate keysProbT00000089360Statistics Rollups failing due to duplicate keysProbT0000008967DConversation Rollups failing with segmentation faults and core dumpsProbT0000009031)Rollup Failure leads to loss of DiscSpaceProbT0000009039UStatistics Rollup job failing with Error: Append to table nh_stats1_951861599 failedProbT0000009067B-4.5.0.Stats Rollup failed, SQL error, duplicate keysProbT0000009114CAdvisory bug only - nhiIndexDiag outputs erroneous error message. ProbT00000091187Statistics Rollups failing due to append to table errorProbT0000009119ZStats Rollups and Stats Index failing with Sql error 1592 - rows contain duplicate keys. ProbT0000009146wE_DM930C_DM0P_PAGE_CHKSUM_FAIL Page Checksum failure detected. Database nethealth, Table nh_dlg1b_954565199, Page 5439ProbT0000009171ProbT0000009224 DB-Roll up'roll-ups failing due to append to tableProbT0000009225db rollup failure%Roll ups fail due to append to table.ProbT0000009227EStatistics Rollup failure: Append to table nh_stats1_955684799 failedProbT0000009250.Unable to execute error doing a database save.ProbT0000009263'roll-up failures with Deadlock detectedProbT0000009271#Conversations roll-ups are failing.ProbT0000009306 DB start-upCusing the nhStartDb command he gets a duplicate record found error.ProbT0000009328dbfetchDuplicates in database. ProbT0000009369ProbT0000009380Roll ups failing.ProbT0000009391BConversation Rollup Failure caused by possibly corrupt Index in DBProbT00000093921Error: Append to table nh_dlg1b_959745599 failed,ProbT0000009399Roll ups failedProbT00000094051Sql Error occured during operation E_US1592 INDEXProbT00000095093poller.cfg not converted on upgrade from 4.1 to 4.6ProbT00000096267Rollups fail due to SQL error occuring during operationProbT0000009636Estatistics rollup failure: Append to table nh_stats1_960328799 failedProbT0000009695.Duplicate keys after running cleanStats scriptProbT0000009715; Conversation rollups failing after Patch 13 was installedProbT00000097228Getting sql error emailed to her after a fetch database.ProbT00000097262Statistics Rollup failure. Append to table stats_1ProbT0000009741GStatistics Rollup failure. Reoccuring even after installation of patch3ProbT0000009840Statistics Rollup hungs.ProbT0000009878 Conversational roll-ups failing.ProbT0000009911R-4.6 P4^Roll ups failing with "Schedule poll was missed next poll will occur" error in the system log.ProbT0000009917iiatribute table is corruptProbT0000009944.nhMvCkpLocation does not source nethealthrc.shProbT0000009966(Conversations Roll up has duplicate keysProbT00000099763Database status GUI shows all zeros intermittently.ProbT0000009977DError occured checkpointing the data base. Running NH v4.6/patch 3.ProbT0000010012R-4.6 P1MSegmentation fault during database conversion for upgrading from 4.1.5 to 4.6ProbT0000010044&Roll ups are failing due to duplicatesProbT0000010063)Statistics Rollup failure ( post patch 3)ProbT0000010122+Sceduled jobs were lost after save and loadProbT0000010142NCommand line nhSaveDb will delete all files in drive due to small syntax errorProbT0000010182dkraussFConversations rollup failing with 'Append to table nh_dlg1b_...' errorProbT0000010290Frequent Rollup problemsProbT0000010301'DMT_SHOW error when doing Database saveProbT0000010341 R-4.5.0P11Stack dump errors in errlog.logProbT0000010350*New executable for nhiDialogRollup failedProbT0000010371<change scheduled default rollups when setting up remote siteProbT0000010399ProbT0000010410ProbT0000010414 R-4.5.0P144nhSaveDb failing: BUS ERROR, generating a core file.ProbT0000010419>Statistic Index failing - cleanStats script not fixing problemProbT0000010546YnhDiagMonitor should also monitor Ingres processes to make sure the database is availableProbT0000010603ProbT00000107685nhSaveDb stops when finds a table with DMT SHOW ErrorProbT0000011096ProbT0000011248ProbT0000011338tCommand line utility that runs a check on the NH Db, tables, indicies, etc, and report on any inconsistencies found.ProbT0000011339WCommand line utility that echo's the output being written during an nhLoadDb to StdOut.ProbT0000011367ProbT0000011394ProbT0000011436%Ability to delete users from databaseProbT00000114943Upgrade appears to fail during database conversion.ProbT00000116151Statisitcs Rollup Failure due to append to stats1ProbT0000011627ProbT0000011686ProbT0000011731xzhangB2-4.8.04element_type 351 is not in the nh_element_type tableProbT0000011742ProbT00000117534Received error during 4.7.1 P01 upgrade of NethealthProbT0000011812hCustomer would like to be able to have multiple checkpoint locations so that daily saves won't overwriteProbT00000119735Sysmod of database 'nethealth' abnormally terminated.ProbT0000012059ProbT00000120714Stats rollup failure: Append to stats1 table failed.ProbT0000012093ProbT0000012095ProbT0000012128-nhDbStatus output is directed to error streamProbT0000012172ProbT0000012198;Conversation Rollup fails due duplicated rows in dlg tablesProbT0000012255^Customer needs to have changes to the rollup schedule logged in a log file for change history.ProbT0000012271ProbT0000012313wCustomer has added custom applications to Traffic Accounting reporting and would like to remove them from the database.ProbT0000012355;Conversation rollup faling with append to dlg1b table errorProbT0000012382/'expression syntax error' received in fetch logProbT0000012410_Expect to see an error written to the save.log when running nhSaveDb specifying an invalid pathProbT0000012529ProbT0000012615#nhiDialogRollup consistently fails.ProbT0000012685ProbT0000012692ProbT00000126960Error: Append to table nh_dlg1s_979707599 failedProbT0000012743< dmcauliffeoIf you create/modify a scheduled Db Save, and enter a leading space(s) for the dir name, no .tdb dir is createdProbT0000012771FA non-ascii cross platform db fetch will corrupt central's time stampsProbT0000012794bDr. Watson errors: nhiDialogRollup.exe Exception: stack overflow (0x00000fd), Address: 0x704f03a7ProbT0000012803Uwould like to have error message when trying to relocate the checkpoint save locationProbT00000128151Conversation Rollup fails with segmentation faultProbT0000012826 R-4.5.0P13LConversation Rollups failing with Append to table nh_dlg1b_981089999 failedProbT0000012829 dblodgett8Ingres stack dumps are causing Nethealth servers to stopProbT0000012850ProbT0000012856PCustomer would like any errors in command line save to alert at the command lineProbT00000128664.7.2 Cert D043Change the way Statistics Rollups are accomplished.ProbT0000013039jdodgeWCustomer would like to be able to specify the number of days a checkpoint save is kept.ProbT0000013102ProbT0000013140ProbT0000013198ProbT0000013223HnhiDialogRollup receives a segementation fault and produces a core dump.ProbT0000013224ProbT0000013240Checksum errors in databaseProbT0000013357,Rollups fail silently - running out of spaceProbT0000013501ProbT0000013562)Conversation Rollup Fails with SQL error.ProbT0000013775*Append to table nh_stats1_981953999 failedProbT0000013843ProbT0000014093ProbT0000014123ProbT0000014221PRequest for GUI meter to be included for the database conversion during upgrade.ProbT00000142294.7.1 Cert D03OTime was changed on server to 08/05/2001, then set back. Now getting bad polls.ProbT0000014260JIngres does not start automatically after reboot, must be started manuallyProbT0000014519 R-4.7.2 P2)Silent Rollup failure as DB grows larger.ProbT0000014537ProbT0000014865Data Analysis is FailingProbT0000015164ProbT0000015211ProbT0000015223nhDeleteElements not working.ALCATEL USA: The beta2 installation failed. Oracle install went clean but apparently there is not enough room on the /apps diskCSCRepeatFCSC: BETA AE- "%NH_HOME%/tmp" not created in install while creating DB$CSC: nhiDbStatus.EXE: Internal Error CONCORD MISOCCRD MIS: not allowing the creating of a database when specifing 2 directories$PWC: Scheduled DBMaintenance failureB3-5.5BRITISH TELECOMkBT Adastral Park: B3 install didnt creat DB Error: You must run query and setup mode before running doit.vEQUANT: BETA AE - You have to cancle the install and then restart and choose not to create DB to continue the installDB Oracle Migration Utility2ALCATEL USA: Jeff Beck - nhSchedule randomly hangs=ALCATEL USA: Jeff Beck - Usage syntax for nhCreateDb is wrongUNISYSbUNISYS: Database error: (ORA-01552: cannot use system rollback segment for non-system tablespac).*Unable to save database due to table error!UNISYS: stuck Ingres DB - DEVKMENSEAGATElSEAGATE(AE: Joseph Madi)Migration document does not indicate if eHealth should be shut down. Please advise.CSC: CCRD Keville- If you run the command nhDestroyDb with the argument ehealth, the old syntax, the $NH_HOME dir is removed.ISEAGATE: FLEXLM License service terminated unexpectedly during migration.dUNISYS: the step that reconfigures the web server using nhiHttpdCfg causes a memory Fault Core Dump.)Memory errors during Conversations rollupGALCATEL USA: The nhiDbStatus processes consume at or close to 100% CPU.'UNISYS: Migration running for 48 hours!PWC: Conversations rollups fails.CONCORD SUPPORT^CCRD SUPPORT: pies went from orange to red. server stop unexpectedly. Multiple error messages Logical NZIiimerge taking up 50% cpu when executing nhSavDb, nhCollectCustData, etc..SEAGATE: nhiPollSave took 4 days to complete. R-5.0.2P1&Database Ascii Save fails, ingres diesFactSet0Ingres unable to allocate more locks lock lists. PostponedAscii database save failsMCIWCOMEnhReset fails to bring processes down cleanly due to an Ingres error. Telstra Corp.(Data on remotes is missing from central:DG BankBnh_subject nh_group and nh_group_list contains dups for name "all" R-4.7.1P5BMW -AGDuplicate keys on fetch CenterBeam Unable to access iidbdb database EARTHLINKEARTHLINK On the front-end non-trusted host, if you execute the nhDbStatus command it runs away and consumes almost all the CPU.5.0.1 Cert D01Telstrascheduled reports failcgouldBT ExactuWhen creating a custom gauge variable, an inverse function in the columnExpression.usr file yields incorrect results.B4-5.5ALCATEL CANADAALCATEL CAN: nhCreateDb failed STATE FARM+State Farm (Beta AE): ran out of Disk Space;State Farm (Beta AE): conversion of Ingres to Oracle failed(EQUANT: Getting error during nhDbCreateOUNISYS: Unsure of status or errors during reload of Polled Data. Error Follow:yUNISYS: While running the Convert MyHealth data during migration run produced multiple core dumps error message follows:CUNISYS: Problems when Attempting to shutdown Ingres. Error Follow:}CCRD MIS: Peter Skotny - problem when running "nhLoadConfigInfo.sh" Bad records in the db that cause nhLoadConfigInfo to failwUNISYS: While checking the eHealth logs after performing the 5.5 migration noticed errors in the DataAnalysis Log File:%ALCATEL CAN: nhLoadConfigInfo failed.OCCRD MIS: Peter Skotny/ProServ -Problem starting the Atlanta Poller. NEED HELP!$British Telecom: a lot of sql errorszBritish Telecom: nhSaveConfigInfo.sh it waits a long time before telling you that the poller initialisation won't completeCOMPUCOM2CompuCom (Jeff Beck): Error with nhsaveconfiginfo,CompuCom (Jeff Beck): Errror with nhcreatedb9UNISYS: Maximum number of cursor reached during migrationEQUANT: oracle is not running ascorupsky$25K for IBM Global ServicesnhConvertDb command fails7didn't consider all possible cases customer could be inBALCATEL CAN: nhLoadPolledData.sh reports errors during migrationJALCATEL CAN: Error: Database error: ERROR: SQLCODE=-258 SQLTEXT=ORA-00258:DCMABDCMA - DB Load moved from sysTosys - segementation, connect errorsHFC Bank/Errors during Data Analysis on nh_hourly_healthAMASOLAMASOL:feature seems not to be supported with beta4 (got an error msg about an non existing oracle path within the ingres backupwALCATEL CAN: TECH TIP: When trying to load the db I received this error: $ nhLoadDb -p /database/beta/ehealth55.tdbnSTATE FARM: BETA AE- At the end of the nhLoadPolledData.sh script, the INDEX.DDL did not load/execute properly#Stats. Rollup fails with sql error.CCRD MISACCRD MIS: -- Oracle ran out of space. Did EH do the right thing?PCCRD MIS: TECH TIP: -- DB did not recover nicely after Oracle ran out of space. CCRD MIS:SCCRD MIS: -- stats table does not exist after recovery from db running out of spaceapier&nhiCfgServer crashes during CWM import dwatersonT-Systems, Bielefeld:Inconsistent database caused by iirelation table deadlocksmpollerSynovus Data Corp.(Scheduled jobs are missing all variables;ALCATEL USA: nhConvertDb error occurred during installation~STATE FARM: BETA AE - "Internal Error nhiPoller[Net] Pgm nhiPoller[Net]: Expectation for `Bad` failed (Element not found in DbPWC"PWC: Scheduled Clean Nodes failureACXIOMACXIOM: Database corrupted R-4.8.0P7 IT-Austria{UNISYS: Clean install with Beta 5- new migration. when Creating Oracle DB got error: Not enough usable space was specified.TALCATEL CAN: Console initialization failed after successful 5.0.2 to 5.5 MigrationPEARTHLINK: BETA AE - looks as though nhConvertDb was not run during the upgrade R-4.8.0P8SBC IDC8Live Exceptions "core" file created when eHealth crashedBank of America $300KEFetch takes too long due to resource contention and bad table writes:"New York Stock Exchange, Inc./SIACInhLoadBackfillData is getting error: "Failed getting schema version time"8At A Glance Report fails bedause of ULH errors in ingresEurotel Bratis< lava as<AAG got ORA-00001 + oraRemoveStatsDups: getting an ORA-00054Lgetting some bad data/huge value/negative value in some stats0/stats1 tablesConcord-MIS OPSLnhUpgradeOracleDb fails becasue it is attempting to shrink system tablespace-Clarification of LCF for customer pre migrate1DDGSOA Dimension Data Managed Services Operations0Split merge caused corruption of Database TablesAtos Origin MOSNXB08WIP;nhConvertDb: *.rpt not migrated properly from 5.02 to 5.6.5invalid object nh_cust_lock R-5.0.2P96nh_daily_symbol_100007 table has reached the 2Gb limitsmoranKluwer Academic Publishers/Error in Health report after migration to 5.6.5,ehealth 565 install failed with oracle error&DDGSOA Dimension Data Managed Services"ostat" files in $NH_HOME/tmpUnisys AustraliaUWhen doing an nhConvertDb during a load getting an error about table BK_VBZ$SCRIPTS 5.6.6-B02FBOP Corporation%nhLoadDb failed with BROKEN SID error,Continental Casualty Company (CNA Insurance)#1MM failing at oracle patch installExcessive ExtentsR-5.6.5 R5.6.5P1DDGSOA Datacraft AsiaAAAG reports taking 45 mins - 1 hour + for certain system elementsBPossible Oracle bottleneck affecting Ad hoc AAG reports in cluster1error=27072 txt: 'HP-UX Error: 27: File too largenhGetCustDb script fails Lexis-NexisvSince upgrading from 4.8 tp 5.6.5 P1, Health and TopN reports fail with a stack dump or take over 6 hours to complete.State of Mississippi ITS0the console shows delete archieve log job failed*Install of database hangs during migrationHCA HealthcareZCapacity Health report run on large groups fails with SQL error referencing invalid cursor:Oracle 5.5 install fails via a disappearing install window%ORA-01115: when running nhPopulateDacR5.6.5P2Telekom Austria AG$Install of P02 fails on nhiConvertDb State Farm9nhManageDbSpace -createLcf is mucking with control files!HSBC?Why aren't Oracle RMAN database saves automatically compressed?@After upgrading from 5.0.2 to 5.6.5, Stats Index Job is disabledStatens ForvaltningstjenestefError during ascii load: Error: The program nhiLoadDb failed during materialized view creation failed.VodafoneSecurity Issues with Oracle.Telcordia Technologies'MS04-011 patch may crash ingres startup R-5.0.2P10'Getting [?,cdb ] :: MSD_INDEX1A_FAILUREComcast IP ServicesUpgrade failure at Comcast?nhConvertDb failed during upgrading patch from 565 P0 to 565 P2 ProvinzialmDatabase save failing due to lack of disk space, although a successful save does not require that much space.3upgrade from 55 to 565 failed with permission errorAElement deletion on central machine fails with UNIQUE index error&cleanup of db required - ref CT 95546.Carolina's Healthcare System:Changing OS platforms to from HPUX to Windows RAID config.!Database error during D02 installProbT0000022182ZGroups are not deleted on the central machine when they are deleted on the remote machine.ProbT0000022201ProbT0000022206ProbT0000022279ProbT0000022284BCEE - De-escalated: Db is converted but they are not polling these 4 custom variables that they created for their Citrix Servers (and they have a lot of Citrix Servers).ProbT0000022359*Unisys - revenue impact per Paul RatajczykProbT00000223617E_PS0F02_MEMORY_FULL There is no more available memory.ProbT0000022409(Reached 2 gig limit on nh_element table.ProbT0000022412ProbT0000022419pdegroot5.6M2*lex engine gives SQL error on LE baselinesProbT0000022420?Bell Canada - escalated by Empowered Networks Stephen GalbraithProbT0000022427nhServer stops unexpectedlyProbT0000022463ProbT0000022499ProbT0000022568ProbT0000022588nDCMA - 130K on the line if this isn't fixed by Monday....also they will 'de-commmision' eHealth from the site.ProbT0000022699GScheduler - DB server closes after scheduled job is modified on the GUIProbT0000022719Sql Error occured during operation (E_USODAE SELECT on table nh_daily_health_1000006: no GRANT or GRANT compatible permit existsProbT0000022742ProbT0000022748Network Insights - customer is losing data. We provided a one-off fix that is unusable due to conflicting library files on another one-off for the same customer. Need to escalate so that eng. can devote necessary time to resolve.ProbT0000022755?Statistics Index Job continues to fail with Duplicate Key errorProbT0000022783ProbT0000022845ProbT0000022907ProbT0000022938ProbT0000022962ProbT0000022973ProbT0000022979ProbT0000022997ProbT0000022998ProbT0000023122BNeed Recovery procedure when nh_element datafile reaches 2GB limitProbT0000023219,Unnecessary output produced from nhIndexDiagProbT0000023256ProbT0000023301<Can not load a Db that was saved on different Oracle server.ProbT0000023419>A progress indicator needs to be added to nhSaveConfigInfo.sh.ProbT00000234542General Dynamics - nothing being written to the dbProbT0000023486BnhLoadPolledData.sh fails with error, however indexing continues ProbT0000023571dbload with problemProbT0000023597ProbT0000023659Reports are running slowProbT0000023676REquant - opened as critical since 7/23 - now customer is pushing this much harder.ProbT0000023702ProbT0000023718ProbT0000023819`EHEALTH Oracle instance does not start even thought the EHEALTH Oracle Service shows as started.ProbT0000023839ProbT0000024007ProbT0000024146: nhSaveConfigInfo.sh failedProbT0000024214lanWan discovery (merge) results in this error:Pgm nhiCfgServer: Assertion for 'elemPtr' failed, exiting (in file ../CfgServer.CProbT0000024241wExplanation needed for this message in errlog.log fille: "Association failure: partner abruptly released association"ProbT0000024274CDataAnalysis failing with error causing report to run for 23 hours.ProbT0000024347Tcustom variables that worked on 5.0, is now causing the nhConvertDb to fail on 5.5.ProbT0000024416DIngres database stops intermittently, problem with iiattribute tableProbT00000244559Unable to execute 'COPY TABLE nh_daily_exceptions_1000001ProbT0000024458ProbT0000024524ProbT0000024551 Dialog rollup causing core dump.ProbT00000245606Table cannot be indexed because rows contain duplicateProbT0000024616ProbT0000024625ProbT0000024728\nhLoadDb fails with error - " failed to create file /busa2/eh.tdb/oracle_export/imp.log" ProbT0000024770Rockefeller Group Telecommunications - escalated per Sue Fanning in the 9/9 Eastern conferenc call. We are looking at having them as a reference account, and this will greatly impact that. ProbT0000024873(Ingres lock quota exceeded; reports failProbT00000249883nhSaveConfigInfo.sh fails on 5.0.2 -> 5.5 upgrade ProbT0000025105)nhiDbServer Cores on Group Config via UI.ProbT0000025160ProbT0000025278YStack dump errors and deadlocks in the database possibly causing the scheduler not to runProbT0000025316ANZ - has been experiencing these poller resets for the past 2 years on 4.7.1 and now on 5.0.2 - we had associated a bug which was fixed in P03 on a prior ticket, but the recent reset is now their big concern.ProbT0000025354nhLoadDb failsProbT0000025357ProbT0000025377AT&T - per Pat Kelly, potential of 50,000 routers. This is the ITS business at ATT we lost and now are winning back. potential of $1 million ProbT0000025420ProbT0000025470ProbT0000025547ProbT0000025549ProbT0000025553,nhFetchDb fails with certain remote pollers.ProbT0000025564ProbT00000255809Fidelity & Getronics - down because unable to load the DBProbT0000025630`replacement of systemTypeDefs.omx by oracleMigration.sh prevents standalone poller from startingProbT0000025709ProbT0000025742Bdeinstall of eHealth 5.5 P01 or P01a fails at the nhConvertDb stepProbT0000025786ProbT0000025796ProbT0000025804WCustomer has dubious partition/hard disk configuration causing problems with nhCreateDbProbT0000025889ProbT0000025891)Console and poller freeze up once a week.< ProbT0000026112ProbT0000026153belements show up in the poller configuration window, that can't be modified and can't be deleted.ProbT0000026189ProbT0000026241ProbT0000026304LCERT Advisory CA-2001-16 Oracle 8i contains buffer overflow in TNS listenerProbT0000026349ProbT0000026445ProbT0000026473:Backend Poller changed to Distributed Console after rebootProbT0000026534ProbT0000026655ProbT0000026676ProbT0000026684UCapital One - escalated per Sue Fanning - this is holding up revenue for this quarterProbT0000026687 rpattabhi%Need prereq checker for env variablesDEV_TASKProbT0000026758rollups failing:error: " Error: Table 'nh_stats2_1032623999' was selected using a min time less than its defined min boundary. ProbT0000026865ProbT0000026903Sql error occurred during operation (ORA-01658: unable to create INITIAL extent for segment in tablespace N) after installing P2ProbT0000027047ProbT0000027052IImerge locked at 100%ProbT0000027053Ingres crashes frequentlyProbT0000027056ProbT0000027111ProbT0000027141(Customer wants explanation for deadlocksProbT0000027256ProbT0000027281nDatabase save is failing with error: Fatal Internal Error: Unable to execute 'COPY TABLE nh_daily_symbol ()ProbT0000027375ProbT0000027394ProbT0000027398ProbT0000027455>the server has shut itself down about 5 times in the past weekProbT0000027456+Fetches are hanging at the nhiInsertElemTblProbT0000027522AORA-01152: file 1 was not restored from a sufficiently old backupProbT0000027542GThe eHealth server stopped after a discovery, now will no longer start.ProbT0000027579pNumber of elements in the poller config UI does not match what is in the dbStatus output or the poller.cfg file.ProbT0000027606ProbT0000027608 R-4.8.0P10ProbT0000027621ProbT0000027623ProbT0000027644ProbT0000027717ProbT0000027753VGetronics Infrastructure Solutions BV - customer is unable to run any health reports ProbT0000027778ProbT0000027826ProbT0000027877tctang>oracle in 5.5 produces error: insert values that are too largeProbT0000027950ProbT0000027953ProbT0000027984fnhiCalcBaseline failed with Error: Unable to open 'dbuBslnInfo'. (dbu/DbuCalcBaseline::writebslnInfo)ProbT0000027986ProbT0000028008ProbT0000028048Error occured during operation: 'ORA-00001: unique constraint (NETADMIN.NH_STATS0_1040381999_IX1) viola'; possible problem dataProbT0000028060KnhRemoteSaveDb with the "-g" and "-gl" arguments breaks Central site GroupsProbT0000028064*E_QE0080 Error trying to position a table.ProbT0000028099<NH_INDEX datafiles set to auto-extend; datafiles filling up.ProbT0000028108*ingres does not restart properly on rebootProbT0000028190(Conversation rollups failing with error. SIDE_EFFECTProbT0000028195Ingres crashes intermittentlyProbT00000282198Added datafiles do not show up in Database Status WindowProbT0000028271ProbT0000028321ProbT0000028356ProbT0000028418AUnable to save database due to failure to retrieve shared memory.ProbT0000028461#nhiStatsIndex fails with the error.ProbT0000028462ProbT0000028470ProbT0000028542ProbT0000028550ProbT00000286042Remote saves run but do not save any stats tables.ProbT0000028663ProbT0000028671ProbT0000028684"duplicate entries in stats0 tablesProbT0000028718ProbT0000028792ProbT0000028824ProbT0000028842ProbT0000028844ProbT0000028864ProbT0000028912ProbT0000028937ProbT0000029031nhConvertDb failedProbT0000029077USchedule database save to mapped drive fails with "fork for 'nhSaveDb' failed" error.ProbT0000029090ProbT0000029122Ithere is data missing in the trend report for a few element for Sept/2002ProbT0000029271ProbT0000029370ProbT0000029396ProbT0000029397ProbT0000029401ProbT0000029460ProbT0000029476ProbT0000029484?Request to increase redo log size and log_buffer init parameterProbT0000029586ProbT00000295894Fetch merge failing, console crashes with core file.ProbT0000029609QnhiIndexStats on 502 patch5 failed if one of the stats0 table contains duplicateProbT0000029618ProbT0000029730AOracle Rollup failures cause permanent disk utilization problems.5ProServ is unable to recover database. Customer down. Song Networks@Unable to start server on eHealth 5.5. during migration from 4.8EMCOR,Database save fails with segmentation fault.cbjorkverizon wireless[Database saves failing with error:"ORA-01455: converting column overflows integer datatype"SBC+Cannot add elements into an existing groupVerizon Bedminster0Fragmentation of tablespace using up disk space.Database performance issue ONO - SpainYDelete Database Archive Logs job failing with bus errors along with Database Save failing R-5.0.2P5#Index is not completing after fetchJones, Day, Reavis & Pogue database load was failedeHealth will not start5.5-P3 nhiCfgServer creating core file.IWhen scheduled jobs hang the Delete Database Archive Log job does not run%Sprint LDD (Long Distance Division) .nhiStdReport: segmentation violation, exiting. SONG NETWORKS9No historical data visible following 4.8 -> 5.5 MigrationGHow to move a saved database archive to another machine on the network.Verizon wireless BedminsterOStats Rolllups failing with "min time less than its defined min boundary" error'Blocking locks causing slow report runs7Redolog wait is too high which causes lot of i/o wait 4eHealth Oracle tuning foro systems with > 4GB memory drecchionVerizon Wireless BedminsterVThe product has no way of identifying the problems of objects that need reorganizationDdci file of 50,000 elements core dumps with one-off 29092 installed.AT&T Labs/ Government;fetch was failed on merging elements from remote to central adisciullo5.5-P3A#x console slow - reopen of PT 27623GOracle database is hung after trying to move objects back to tablespaceyDuplicate elements on the remote poller-Duplicate key on IN Pgm nhiMsgServer (E_US1194 Duplicate key on INSERT detected.Telemar0Groups created on Remote Poller not coming over.+Bank of America - escalated per Sue FanningAStatistics rollup failures with associated Ingres error messages.Database load will not initiatePoor console performance'nhDeleteElements does not work properlyCdatabase save fails with DMT show error on one of the stats0 table R-5.0.2P5AFirst National Bank6fatal database error received during patch applicationVerizon'Lots of extents cause poor performance.2Stats index failing due to duplicate stats entriessrollup failling with Sql Error occured during operation : (ORA-01455: converting column overflows integer datatypeONO!Data analysis has error messages.VZWrPatch: Oracle security patch issue: dbsnmp agent patch not installed on ehealth servers running 8.1.7 .4 servers Qualcomm IncorporatedProbT0000015241GTable iirelation (owner $ingres) has a mismatch in number of columns.ProbT0000015386Database save fails at Fatal Internal Error: Unable to execute 'COPY TABLE nh_stats0_991166399 () INTO '/opt/health/save.tdb/nhProbT0000015402ProbT0000015475ProbT0000015488@Statisitcs Rollup failure due to non-recoverable DMT_SHOW error.ProbT0000015541#Database is frequently inconsistentProbT0000015673ProbT0000015689CMissing two polling cycles because of XLIB error in the syslog.log.ProbT00000157091Chase Bank - $400K which they need to get into Q3ProbT0000015757ProbT0000016044UDbSave fails randomly: nhiSaveDb.exe: Fatal Internal Error: Unable to execute 'COPYProbT0000016098Dbsave after force consistent.ProbT00000161040nhDbStatus does not accept redirection of outputProbT00000162003Ingres server stops after stack dump and bus errorsProbT0000016261ProbT0000016372KOutput of nhIndexDiag shows tables as HEAP when they should have been BTREEProbT0000016460ProbT0000016590ProbT0000016950ProbT0000017078/nhiRollupDb Begin processing (08/13/2001 < 09:43:58 AM). Error: Sql Error occured during operation (E_US1592 INDEX: table couldProbT0000017637ProbT00000176815nhSaveDb is failing with 'segmentation fault' errors.ProbT0000017716maintenance job hangingProbT0000017743ProbT0000017799ProbT0000017816ProbT0000017821u E_RD0060 Cannot access table information due to a non-recoverable DMT_SHOW error (Mon Sep 10 05:08:27ProbT00000178285Data Analysis is failing nightly after upgrade to 4.8ProbT0000017941kRalston Purina - customer can't save db, this is a down situation, and there is also a revenue impact to Q3ProbT0000018012ProbT0000018031A2-5.5FCreation of nethealth database hangs nethealth install at SQL> prompt.ProbT0000018085%Cannot execute nhSaveDb without errorProbT0000018090Database stops intermittently.ProbT0000018104ProbT0000018152Ingres Event Log errorsProbT0000018160wThe install failed, had a problem creating the redo.log file in the /export/platinum1/bd54.5/oradata/NHTD directory. ProbT0000018165#No DBMS Servers during nhDestroyDb.ProbT0000018205~Customer is getting the error: E_CO003F COPY: Warning: 4 rows not copied because duplicate key detected. in their fetch logs.ProbT0000018218nalaridK'Server stopped unexpectedly, restarting' error upon console initializationProbT00000182734Ingres license failure, unable to start the databaseProbT0000018286After rebooting the system the NHTD instance does not start automaticaly after reboot, causing nethealth server to fail at startProbT00000183267Chase Bank - had 16 hours of no data in the past 2 daysProbT0000018443 R-4.7.1P6See DDProbT0000018455nhiCfgServer crashesProbT0000018476ProbT0000018609Consistency Check ErrorProbT0000018611Problem: Error in save.logProbT0000018631nhSaveDb failing.ProbT0000018658 R-4.7.2 P3Statistic Rollups keep failingProbT0000018917Several large distributed polling customers are using nhReset to free up table space that is consumed by multiple fetches per day.ProbT0000018930"nhDbStatus causes database failureProbT0000018934(nhiDialogRollups appear to fail silentlyProbT0000018949ProbT0000018957ProbT0000019081/Statistics rollup failing due to duplicate keysProbT00000191319iidbms using 104% of cpu on dual cpu HP UX ehealth ServerProbT0000019149ProbT0000019163ProbT00000192388Stats0 tables unreferenced in the nh_rlp_boundary table.ProbT0000019241*Fetch failing with duplicate element_id's:ProbT0000019243*RemoteSaves with -g and -gl options fails:ProbT0000019293'Stack dmp name error in errlog.log fileProbT0000019303ProbT0000019307%Conversation Rollups failing silentlyProbT00000193213Scheduled dataAnalysis log filling up with error. ProbT00000193272Indexing and Rollups failing due to duplicate keysProbT0000019333R-5.0.1nhiLiveExSvr dyingProbT0000019342AT&T and US Lec - both downProbT0000019349ASCII DB Save fails.ProbT0000019365ProbT0000019376%dialogRollup failure: dupls on tablesProbT0000019379ProbT0000019394B5-5.0.0ProbT0000019436ProbT0000019508knewman7Scheduled Health reports did not import during nhLoadDbProbT0000019575OStatistics Index and Rollup jobs failing with an E_US1592 duplicate keys error.ProbT0000019689Rollups failing.ProbT0000019693DB load error.ProbT0000019695DbSave fails Unloading table nh_active_alarm_history . . . Fatal Internal Error: Unable to execute 'COPY TABLE nh_active_alarProbT0000019708DJPL's database is growing at 400MBs per day, and is currently 36 gigProbT00000197297Data Analysis silently failing and generating core fileProbT0000019730 4.8 Cert D05=Database unable to recover from deadlock on iirelation table.ProbT0000019733ProbT0000019759ProbT0000019762$Cannot complete Ingres patch installProbT0000019797ProbT0000019798ProbT0000019844?Console crashes with "ULH error" and "Assertion for '0' failed"ProbT0000019846ProbT0000019879ProbT0000019887eEscalating per Robin Trei to keep focus on this until we can determine how widespread the problem is.+Live Exceptions server stops Network HealthProbT0000019888?The database 'nethealth' is not created successfully on installProbT0000019937&Receiving log lock errors on discoveryProbT0000019972ProbT0000020102CnhServer start fails to connect to dbms due to lock quota exceeded.ProbT0000020115ProbT0000020153ProbT0000020242dsionneHCannot upgrade to or install 4.8 with read-only permissionson /usr/localProbT0000020296ProbT0000020298ProbT0000020302ProbT0000020382B6-5.0.0ProbT0000020406ProbT0000020424ProbT0000020428ProbT0000020439ProbT0000020481ProbT00000205024DeTeSystem, Project get it plugged (form Rheinpfalz)ProbT00000205057After upgrade, file aaaaaaaa.cnf has wrong permissionsProbT0000020509ProbT0000020521#nhSaveDb failed with DMT_SHOW errorProbT0000020523ProbT0000020576ProbT0000020579ProbT0000020616ProbT0000020654ProbT0000020678Fortis - our largest customer in Belgium. Also looking at $50K new revenue in Q2. Add Morgan Stanley with another $50K for Q1 riding on this.ProbT0000020682ProbT0000020695*nhFetch and database save fail with error.ProbT0000020733WFortis Bank - our largest customer in Belgium. Also looking at $50K new revenue in Q2.ProbT0000020749>Conversation rollups are not terminating when job is finished.ProbT0000020829receiving Bus error.ProbT0000020841ProbT0000020869ProbT0000020870ProbT0000020884ProbT0000020885ProbT0000020901ProbT0000021040ProbT0000021068!nhFetch takes 4+ hours to finish.ProbT0000021079ProbT0000021106ProbT0000021108ProbT0000021109ProbT0000021123ProbT0000021146nhSaveDb failesProbT0000021155ProbT0000021156ProbT0000021157ProbT0000021159ProbT0000021164ProbT0000021179ProbT0000021182ProbT0000021208ProbT0000021232ProbT0000021234ProbT0000021235ProbT0000021260ProbT0000021261ProbT0000021265ProbT0000021281ProbT0000021330ProbT0000021350nhiSaveDb failsProbT0000021358ProbT0000021380ProbT0000021393ProbT0000021438ProbT0000021440ProbT0000021447Logical lock count exceeded.ProbT0000021478ProbT0000021527 Bank of America - revenue impactProbT0000021567B5-5.5ProbT0000021571Multiple dlg table issues:ProbT0000021583ProbT0000021584ProbT00000215920NTL - This case spawned from 15738 (59110/61330)ProbT0000021601ProbT0000021625ProbT0000021665ProbT0000021670PInstallation of CA Arcserve backup software caused licensing failure for Ingres.ProbT0000021674ProbT0000021697(Statistic Rollups failing with core fileProbT0000021698ProbT0000021706ProbT0000021732ProbT0000021744ProbT0000021793ProbT0000021795ProbT0000021832(More than one "all" group in group listsProbT0000021841ProbT0000021849ProbT0000021861ProbT0000021883YServer crashes when setting Live Trend to fast polling on Response Path chart definitionsProbT0000021926ProbT0000021938ProbT0000021941ProbT0000021957!nhiNtScm -start "Ingres_Database"ProbT00000219654CompuCom Systems: Unable to connect to the database.ProbT0000021980NSql Error occured during operation (E_US1194 Duplicate key on INSERT detected)ProbT0000022017InhiRollups fail with duplicate key error, cleanStats script doesn't help.ProbT0000022032ProbT0000022034ProbT0000022056ProbT0000022069ProbT0000022092ProbT0000022116ProbT0000022146ProbT00000221555.0.2 Cert D02 Excessive extents being created.Error from nhiClearDb in 2MM!nhConvertDb Failed with sql errorHuntington & Alltel.ora 01555 snapshot too old during ASCII save 7error from installing ODBC driver on the win NT machine7Unable to open database due to stuck recovery of threadConcord Professional Services&Performance questions using NFS mounts7Upgrade from eHealth 5.5 to eHealth 5.6.5 fails quciklyVeterans Health Administration6Database conversio< n failing during 5.5->5.6.5 upgrade.Defense Logistics Agency=Database conversion failure during upgrade from 5.5 to 5.6.5.AMT Credit Agricole0Statisitics Index failing on French installationLData gaps occuring in data after nhLoadDb from one server to another server. T. Rowe Price:eHealth 5.6.5 two machine migration to Solaris 2.8 failingfarroyoR-5.6.1-all reports fail after installing new licenseBT-Ignite (Leeds)$nhCreateDb is failing on six serversUS Secret ServiceJWhen starting the ehealth server the nhiDbServer crashes with a core file.&eHealth upgrade 5.6.5 from 5.5 failed.Unable to start nhServerADP Dealer ServicesnhConvertDb appears to be hung-poller crash once several days with sql errorSeHealth 5.6.5 install is failing on a sqlplus query that falls in the verifyOraInst!General Communications Inc. (GCI).ORA-1172 signalled during: ALTER DATABASE OPEN7Statistics rollups failing with unique constraint errorahuang5.6I18NMyNetLabaCan't generated Top-N Report on both Web and GUI Console associating with Bandwidth Utilization 6National Electricity Market Management Company Limited5Dbase files running out of diskspace. Sizing concernskstats rollup failed with error of E_LQ0058 Cursor 'nh_stats0_1063022399QC' not open for 'close' command. Deloitte & TouchenhSaveDb errors out.R5.6.5P1University of Cincinnati6Duplicate key error message when running Trend report.Veloz Global SolutionsnhExportData fails with error:+data analysis failed due non-existing indexYipes Enterprise Services Inc."ORA-04031 error when running query>scheduled database failed every other day on ehealth565 patch1zSome of Customer's Health Reports are failing with this error: Fatal Error: Assertion for 'cdbSampleLoopCnt++ == 2' failedjkaufman R-5.5 R-5.0.2Union Bank of Switzerland AG=How does customer knows that the nhLoadPolledData failed. ,State of Kansas Dept. of Administration-DISC@Unable to create oracle database during the migration processes.BT Ignite - West Malling6Migration from 5.02 to 5.6.5 on Solaris 8 core dumped Fraport AGCompressing ASCII Backups$Kabel Berlin Brandenburg Gmbh &CO Kg>Could not create the Java virtual machine Oracle install fails PCCW (IDRS)MoreInfo9ORA-04031: unable to allocate 4212 bytes of shared memoryyleiR-5.0.2P4 5.0.2 Cert D05<Go duplicate keys in NH_GROUP_MEMBERS_IX1 from load.log file!Great West Life Assurance Company(stats rollup failure with overflow errorCAscii save/load causes inconsistency with nh_rlp_boundary table 5.6.5Nord/LB@nhManageDbSpace fails to move large datafile, when file is > 2G.&Automobile Club of Southern California Field Test'ORA-04031: DbApi cannot handle raw data-Error says Unable to connect to the database.Banamex$Request for Oracle 9.2.0.4 patch set#Sparkassen Informatik Services WestJMigration failing with error Fatal database error: Step 2 in sa_rev_r55_15YesEHVER Hartford Inc.Db API cannot create views. ORACLE PARAMS eDeltacom,new system migration failed on nhSaveMigrateInovant"Rollup failed with duplicate errorUS BankUsBank sizing issueNUMBERTYPEDB DESCRIPTION ROOT CAUSEBUGORACLE,StopDb didn t detect open Oracle connectionsCODE6Prevent Live Poller crash due to invalid ROWID request(Add indexes to speed up slow conversionsNhExportData error-Saves sometimes fail  introduced in 5.6.5 P1JAscii save/load does not correct an inconsistency in nh_rlp_boundary table>NhiOraPatch would hang accessing certain mount points (Sulfer)uOracle 8i to 9i upgrade code will reduce system tablespace to 200MB if a customer has increased it already beyond 200Temp files not cleaned up BUG - REPEAT`Introduced in Viper P1  Ingres files were not copied properly, and not available for conversionSUPPORT SCRIPT ERRORINGRESLA custom script  TaDataPurge needed enhancement to not delete unknown MTFs.)Bad data caused indexes to not be createddRP deletion on central machine could fail because it couldn t put DUPs in the deleted elements table=Load DB error introduced in 565 P2 due to array out of bounds NON EH DATACustomer created a special table in their DB for CERT, table name exceeded 30 characters, which cause upgrade to 5.6.5 to fail on DB conversionDATABAD DATAjSTATS data had a negative delta time value, that caused rollup failures. Unclear what got the data there.Jt Invalid time value loaded into Oracle from Ingres as part of backfill  how did it get there? Will it happen again?HVerizon noticed an invalid object in the database. The object NH_CUST_LOCK was created by nhGetCustDb. It was not being deleted at the end of nhGetCustDb.Customer installed Oracle Enterprise Manager on eHealth DB (illegal), which created a table containing an unsupported datatype that convertDb could not handle on upgrade to 5.6.5Non-eHealth table  TEMP_STAT_MIN_MAX was in DB, with an unsupported datatype, that convertDb could not handle on upgrade to 5.6.5DUPSFrench customer had Dups  wrong patch level to send them MANAGE_STATS_DUPS function to obtain info about how dups are getting there. Manually fixed up their DB.5.6.5  Rollups failed due to Dups. Poller team found the data to be identical to some Poller dups that were fixed in 5.0.2 P7.Reports fail due to duplicates in the DB  proposal to add two debug columns of data for the poller team to track down the cause of dups.SDuplicate keys in group members table  removed but not source not pursued (Ingres)BRollups failed  dups. Root cause researched in 38099 by Santosh.FUnique constraint failure in Poller. Pursuing with  stats forensics .Duplicates in stats tables. Poller team thinks the problem might not be poller - could be in rollups. Robin sent stats forensics to try to pinpoint cause.SET WRONG ENVIRONMENTJapanese system  customer sourced Ingres environment and ran Oracle createDb. Causes error message but DB creation was successful. Duplicate of 35712 ENVIRONMENTCustomer made a mistake in 2MM, and tried to recover. CreateDb failed, but it turned out there were non-DB files missing in the install tree (e.g.  messageText.sys .) Customer cleaned system and re-ran install successfully.OS PATCHCustomer DB file got corrupted. Traced the problem to a missing Solaris patch, and Sun doc on how it causes this issue. Customer restored from tape and applied the OS patch.USER ENVIRO< NMENT KBInstall failed with SQL error. Problem was Oracle did not link correctly because customer LPATH environment variable was set, overriding the linker default search path. KB article created.NODUPL AFTER UPGRADERStarting server causes crash in 5.5. Problem was permissions issues on log files.;Remote poller hang during merge. Closed as fixed in 5.6.5.INGRES REMOTE POLLERIRemote poller hang during merge  has now been in MoreInfo for two weeks.YDups in DB  got info about the dup elements and had Support create a poller team ticket. INGRES DUPS]A particular Ingres patch can crash Ingres at startup. We wrote a KB solution to address it.NO REPLYuTrend report fails due to Dups in DB. Requested advanced logging from customer, no reply for a month, ticket closed.Rollups failed with cursor error. Customer was keeping 70 weeks of raw data. Logs showed they had had problems for 6 months before the call. Gave customers instructions to try to recover data. No reply.CBad data in some stats tables. Large negative values exist in AR element data. Same incident as PT 37721. Met with AR team to determine what is causing bad data. Determined that the bad data gets populated by SysEdge. Tried to pursue getting SysEdge element info from customer. No reply for one month. Ticket closed. Ascii save fails with  Snapshot too old error. Rollback segment was filling up. Culprit probably AR which creates many small transactions. Disabling AR allowed save to complete. Oracle admitted that this is a bug, ID 3158889 (for Oracle 8.1.7  this is eH 5.5.)Upgrade from 5.5 to 5.6.5 failed. The user did not have the Java SDK installed. Eventually got other errors and PT 36977 was used to track them. NOBUG&System tablespace cannot extend itselfProbT00000297482stats rollup failed with create stats1 table errorProbT00000298164Oracle database will not open after P3a installationProbT0000029844ProbT00000298667NH_INDE01.dbf file is too large to be handled by oracleProbT0000029873 R-4.8.0P13UWhen a fetchDb fails and is run the next day, there is no analyzed data for that day.ProbT00000299151running out of disk and could not access instanceProbT0000029924ProbT0000029954ProbT0000030001ProbT0000030092ProbT0000030159ProbT0000030206#eHealth report performance problem.ProbT0000030233ProbT0000030315ProbT0000030347ProbT0000030441,database save failed with memory fault errorProbT0000030475sql error from nhiIndexStatsProbT0000030476duplicate key on nh_node tableProbT0000030477ProbT0000030497Stats index job failing.ProbT0000030591ProbT0000030604>Statistic indexing failure due to duplicate keys on 5.0.2 P05aProbT0000030609khamner 4.8 Cert D068Ranges 3050 and 3350 do not work with custom variables ProbT0000030628merge 27606 and 28824 to 5.6 B4ProbT0000030646ProbT0000030771?reports fail with "ORA-01476: divisor is equal to zero." Error.ProbT0000030882JPrimary key Sql errors and _elemCache assertion failures by nhiPoller procProbT0000031056botoole R-4.8.0P8 D07<Ingres errors preventing NetHealth start on beta test systemProbT00000311072nhDbStatus reports an incorrect number of elementsProbT0000031113ProbT0000031136ProbT0000031156<Traffic Accountant message that the transaction log is full.ProbT0000031248=Issue with database load not checking for necessary datafilesProbT0000031273Dup Stats error on OracleProbT0000031357 shonaryar4nhDbStatus reports an incorrect number of elements ProbT0000031387+Conversation Rollups failing with core fileProbT0000031392ProbT0000031405)Customer wants to install oracle patches.ProbT0000031477ProbT0000031499\Unable to use Memory Windows message in the $ORACLE_HOME/rdbms/log/alert_EHEALTH.log file.ProbT0000031504#Roll-up data seems to be incorrect.ProbT0000031605ProbT0000031620ProbT0000031707Health Report fails with error.ProbT0000031721ProbT0000031804ProbT0000031810#Received core dumb from nhiDbServerProbT0000031892tThe customer claims the conversation rollups keep their disk from filling up but they do not poll conversation data.ProbT0000031898JPatch: Customer would like Concord's response to Oracle Security Alert #54ProbT0000031900WPatch: Customer would like Concord's response to the following Oracle Security Alert 48ProbT0000031901@Patch: would like Concord's response to Oracle Security Alert 49ProbT0000031902@Patch: Customer would like Concord's response to Oracle Alert 50ProbT0000031905@Patch: Customer would like Concord's response to Oracle Alert 51ProbT0000031906@Patch: Customer would like Concord's response to Oracle Alert 20ProbT0000031907@Patch: Customer would like Concord's response to Oracle Alert 29ProbT0000031915Patch: Customer would like Concord's response to Oracle ALTER SESSION privilege can dump trace files with possibly sensitive datProbT0000031969ProbT0000031991ProbT0000032378 ehealth 5.5:Need to certify Oracle patches for use with eHealth on NT.NO_FIXProbT0000032404ProbT0000032410ProbT0000032480ProbT0000032486ProbT0000032505B4-5.6=Group and Group lists did not migrate from 5.0.2 to 5.6.0 B4.ProbT0000032601rtrei5A-5.6.1don't index stats0 right awayProbT0000032605ProbT0000032668ProbT0000032783ProbT00000327967beta 4 to beta 5 upgrade fails getting Oracle password.INSUFF_TESTINGProbT0000032804ProbT0000032845ProbT0000032849ProbT0000032865ProbT0000032868ProbT00000328758Verizon Wireless-Bedminster Verizon Wireless-BedminsterProbT0000032904ProbT0000032917ProbT0000032938ProbT0000032943ProbT0000032948ProbT0000032979ProbT0000033048ProbT0000033060ProbT0000033063ProbT0000033098IeHealth 5.6 Migration takes too long due to NH_MANAGE_STAT_DUPS forensicsProbT0000033130ProbT0000033134ProbT0000033141ProbT0000033172Live Ex Baseline FailureProbT0000033181ProbT0000033194ProbT0000033228ProbT0000033264 R-5.6.1 R-5.6$oracle installation failed on 3rd cdProbT0000033283+nhForceDb does not default to NH_RDBMS_NAMEProbT0000033365ProbT0000033392ProbT0000033436ProbT0000033454ProbT0000033462ProbT0000033507ProbT0000033540ProbT00000335710nhPurgeDeleted fails with no indication of causeProbT0000033667ProbT0000033696ProbT0000033705ProbT0000033746ProbT0000033763ProbT0000033922ProbT0000033982ProbT0000034020ProbT0000034031ProbT0000034124ProbT0000034169ProbT0000034256ProbT0000034288ProbT0000034318>nhDbStatus from the command line does not work on Windows 2000ProbT0000034347ProbT0000034364ProbT0000034400ProbT0000034508ProbT0000034529ProbT0000034571- none -+back port nhCollectDbData for MoreInfo toolProbT0000034573ProbT0000034643pmorgan5.6.5B2)moreinfo on BE fails for database.collectProbT0000034652ProbT0000034682ProbT0000034734ProbT0000034769ProbT0000034772 llopilato5.6.5M22nhsReportDups.sh does not work in the windows env.ProbT0000034826bmirandaENH_HOME_RDBMS_INGRES is not set when trying to run nhsDiffCnfig -1mm.ProbT0000034847ProbT0000034871ProbT0000034876ProbT0000034964ProbT0000034971ProbT0000034988ProbT0000035001ProbT0000035009ProbT0000035011ProbT0000035050ProbT0000035063ProbT0000035115ProbT0000035149ProbT0000035151ProbT0000035168ProbT0000035177ProbT0000035178ProbT0000035183ProbT0000035201ProbT0000035325ProbT0000035347ProbT0000035421dandrews6PATCH BLOCKER - Compilation error in CdbTblElemStats.C< ProbT0000035423ProbT0000035442ProbT0000035465ProbT0000035520ProbT0000035521ProbT0000035522ProbT0000035538ProbT0000035555ProbT0000035595ProbT0000035610ProbT0000035632ProbT0000035712ProbT0000035736ProbT0000035780ProbT0000035823ProbT0000035827ProbT0000035842ProbT0000035845ProbT0000035848ProbT0000035869ProbT0000035871ProbT0000035895ProbT0000035911ProbT0000035940ProbT0000035952ProbT0000035972ProbT0000035977ProbT0000035987ProbT0000035993ProbT0000036010ProbT0000036018ProbT0000036034ProbT0000036050ProbT0000036088ProbT0000036116ProbT0000036153ProbT00000361775.6.5I18N B1 5.6.5B1BError message During the eHealth 5.6.5 Spanish Beta 1 installationProbT0000036191ProbT0000036214ProbT0000036226ProbT0000036274ProbT0000036318ProbT0000036320ProbT0000036341ProbT0000036342 5.6.5I18N B11nhCreateDb command output the error to log file ProbT0000036357ProbT0000036382ProbT0000036387ProbT0000036389ProbT0000036403ProbT0000036438ProbT0000036555ProbT0000036692ProbT0000036700ProbT0000036702ProbT0000036707ProbT0000036737ProbT0000036793ProbT0000036815ProbT0000036826ProbT0000036910ProbT0000036935ProbT0000036944ProbT0000036977ProbT0000037077ProbT0000037088ProbT0000037108ProbT0000037119ProbT0000037129ProbT0000037154ProbT0000037159ProbT0000037181ProbT0000037243ProbT0000037259ProbT0000037472ProbT0000037512ProbT0000037543ProbT0000037564ProbT0000037567ProbT0000037578ProbT0000037587ProbT0000037679ProbT0000037697ProbT0000037721ProbT0000037727ProbT0000037771ProbT0000037798ProbT0000037828ProbT0000037830ProbT0000037833gjones;ehealth 5.6.6 beta 1 and the oramod plugin are incompatibleProbT0000037856ProbT0000037883ProbT0000037905ProbT0000037948ProbT0000037963ProbT0000038027ProbT0000038030ProbT0000038032ProbT0000038035ProbT0000038089ProbT0000038139ProbT0000038193ProbT0000038217ProbT0000038259ProbT0000038333ProbT0000038355ProbT0000038359ProbT0000038360ProbT0000038378ProbT0000038457ProbT0000038509ProbT0000038538ProbT0000038545ProbT0000038549ProbT0000038563ProbT0000038583ProbT0000038594ProbT0000038598ProbT0000038599ProbT0000038627ProbT0000038648ProbT0000038672ProbT0000038688ProbT0000038690ProbT0000038695ProbT0000038745ProbT0000038747ProbT0000038763ProbT0000038782ProbT0000038801ProbT0000038813ProbT0000038832ProbT0000038840ProbT0000038868ProbT0000038887ProbT0000038901ProbT0000038902ProbT0000038910ProbT0000038932ProbT0000038949ProbT0000039008ProbT0000039115ProbT0000039175ProbT0000039229ProbT0000039364ProbT0000039372ProbT0000039380ProbT0000039399ProbT0000039407ProbT0000039409ProbT0000039542 rsanginarioR5.6.5P3WnhLoadDb fails on a clean 565 P3/D3 system, complaining about being unable to ConvertDbProbT0000039579ProbT0000039584ProbT0000039614ProbT0000039638ProbT0000039646ProbT0000039722ProbT0000039723ProbT0000039757ProbT0000039932ProbT0000039936ProbT0000039941ProbT0000040059ProbT0000040076ver(tmp)DaysOpenRawVer EscalatedDetailed DescriptionTConversation and Statistics Rollups are failing with SQL 1592 and SQL 1591 errors. ProbT0000007474tDatabase Rollups are failing with: E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. ProbT0000007498CStatistics_Index and Statistics_Rollups fail with SQL1592 errors. ProbT0000007506]Cusrtomer getting "Error: Append to table nh_stats1_953787599 failed..." after manual rollup.ProbT0000007541AStatistics_Rollups fail with error: E_US125C Deadlock detected. ProbT00000075809Roll ups failed due to" table nh_dlg1s_953701199 failed."ProbT00000075916Statistics_Rollups failing with Sql error E_US1592 ProbT0000007629dpatel2Statistics Rollup failing with duplicate key errorProbT0000007671YCreating the system log as a file, then parsing it off and saving if for the last 7 days.ProbT0000007672(Make the rollups clean up after failure.ProbT0000007678xnhiLoadDb fails with message: E_CO0039 Error processing row 1. Cannot convert column 'poll_rate' to tuple format. ProbT00000077333Statistics Rollups failing with duplicate key errorProbT0000007768'Duplicate Key in Database during RollupProbT0000007802Statistics Rollup failureProbT0000007839UData analysis failurure/DAC cannot scanle to amount of elements we say we can supportProbT0000007879Rollups failingProbT0000007937DAbility for user to specify checkpoint save location via the ConsoleProbT0000007962-rollups fail with sql error, duplicate keys ProbT00000079672Short - rollups fail with append to table error ProbT00000079770Statistics rollups failing due to duplicate keysProbT0000007983QGot "database failed to convert" after trying to install P12 on my 4.5P10 system.ProbT0000008005Statistics rollups are failing.ProbT0000008016%rollups failing with duplicate keys ProbT0000008036vConversation Rollup failing with "E_US1591 MODIFY: table could not be modified because rows contain duplicate keys." ProbT00000080470Statistics rollups failig due to duplicate keys.ProbT00000080505Tables could not be indexed because of duplicate keysProbT0000008100/capability to merge a saved db into current db.ProbT0000008162:Conversation rollups failing with append to table error. ProbT0000008173*Append to table nh_dlg1b_950936399 failed,ProbT0000008179Database inconsistancyProbT0000008184dLOad failing with error: "Error: Uncompress of file /dbsave/MWF.tdb/nh_stats0_956645999 failed. ..."ProbT0000008249Inconsistant DBProbT0000008302nhFetchDb failedProbT0000008310Statistics-rollup failingProbT00000083707Statistic Rollup and Index failures and DMT show errorsProbT00000083856Statistics Rollups failing with append to table error.ProbT0000008386@Roll ups are failing due to duplicate table nh_stats2_908665199'ProbT0000008403CStatistics_Index fails with Sql error, rows contain duplicate keys.ProbT0000008421FStats rollup and stats index failing with sql error, duplicate keys. 5/5/2000 9:04:23 PM SysAdmin nhDbStatus does not show correct amount of free space. Here is the infor provided by the reseller. ICS support@ics.de HOST NAME: BRONX HOST ID: IP ADDRESS: x.x.x.x VENDOR: SUN OPERATING SYSTEM: Solaris 2.5.1 HARDWARE MODEL: Enterprise 450 (3 CPU) WINDOWING SYSTEM: CDE SWAP SPACE: 100 MB MEMORY: 128 MB LICENSE DATA: ============= SOFTWARE STATE: Installation VERSION: 4.0.1e EXPIRATION DATE: never POLLER CODE: POLLER CHECKSUM: LAN CODE: LAN CHECKSUM: WAN CODE: WAN CHECKSUM: NUMBER OF ELEMENTS: TICKET-DATA: ============ TYPE: Problem PRIORITY: 4 STATUS: New LONG DESCRIPTION: The Free Disk Space of the partition containing the ingres db is not always reported correctly by nhDbStatus. This customer has a 26GB /DATA partition containing the nethealth db. There are about 18GB free on this partition. nhDBStatus shows only about 4GB. See also output of df -k and nhDbStatus below: df -k : Filesystem kbytes used avail capacity Mounted on /dev/dsk/c0t0d0s0 96391 41365 54930 43% / /dev/dsk/c0t0d0s6 1015679 409752 604235 41% /usr /proc 0 0 0 0% /proc fd 0 0 0 0% /dev/fd /dev/dsk/c0t0d0s4 336871 67920 268615 21% /var /dev/md/dsk/d10 8245317 170168 7992696 3% /export/home /dev/dsk/c0t0d0s5 336871 12619 323916 4% /opt /dev/md/dsk/d20 26081505 7415728 18404< 962 29% /DATA swap 2420984 2992 2417992 1% /tmp nhDbStatus: Database Name: nethealth Database Size: 2488092000 bytes Free Disk Space: 4194303999 bytes Database Location: /DATA/nethealth/idb RDBMS Version: OI 1.2/01 (su4.us5/00) Statistics Data: Number of Elements: 3123 Database Size: 2304164000 bytes Last Roll up: 07/15/98 00:22:39 Latest Entry: 07/21/98 10:53:53 Earliest Entry: 01/15/98 00:00:00 Conversations Data: Number of Probes: 45 Number of Nodes: 9036 Last Roll up: 07/15/98 00:22:23 As Polled: Database Size: 131496000 bytes Latest Entry: 07/21/98 09:55:39 Earliest Entry: 07/08/98 00:00:00 Rolled up Conversations: Database Size: 26906000 bytes Latest Entry: 07/07/98 23:59:59 Earliest Entry: 03/01/98 00:13:35 Rolled up Top Conversations: Database Size: 13710000 bytes Latest Entry: 07/07/98 23:59:59 Earliest Entry: 01/15/98 16:34:11 ICS Support Contact : Support Center Tuesday, July 21, 1998 1:55:34 PM lgraves Dear Concord Reseller, This is to acknowledge receipt of your e-mail message. We have assigned your message the following call ticket number 14738. A Support Representative will be responding to you directly by e-mail or phone. If you wish to inquire about the status of your message, please refer to the call ticket number above. CallT0000016033 Jaime Sanchez of LSI Logic, Inc. at 408-433-7892 The third question is in regards to error messages generated on the console window: "Tuesday, August 18, 1998 03:23:03 PM Warning (Statistical Poller) The database space is getting low, you should make more space available". This error message is repeated every 4-5 minutes. I've checked the partition size (only 25 % full) and we have plenty of space. When checking the database status we show the database size as 356,156 K with 180,832 K free disk space and 2,090 elements discovered. Any ideas ? Does the DB size and free space look okay to you? Let me know what other information you require. Those are all the questions I have at the moment. Again, if my questions require a significant time to answer or resolve, please feel free to open a new ticket or let me know and I will generate a new email or call tech support to open a new ticket. Thanks for your help and I'll talk to you tomorrow morning. Thanks, Jaime 5/5/2000 9:04:25 PM SysAdmin If the Network Health systems stops (for instance with database inconsistencies) and can't be broght back to life again, it's not possible to directly access the system log. The system log therefore should be written to a file (probably in addition to writing it to the database). Yes, it's a possibility to do a nhForceDb, but that doesn't always work and the customer might decide not to do the force, but reinstall from a backup, if he had access to the system log in the first place. Also, if system log was a file, one could use logfile monitors (for raising alarms, etc.). n5/5/2000 9:04:26 PM SysAdmin Wednesday, September 02, 1998 12:22:01 PM dinsmoc Warning statistcs poller the database space is getting low. They have 15 gigabytes of space but the database status reports they only have 115 mb on a 1915 element database Wednesday, September 02, 1998 5:07:32 PM dinsmoc Spoke with Jaime and gave him the disable poll spacecheck variable Thursday, September 03, 1998 9:49:38 AM dinsmoc This problem actually casued network health to stop polling and until I gave him the disable spacecheck variable it was giving him the poller error message. Spoke with Amy she believes that this is a new but related bug to the original bug we have with reporting space over 4gb but in that bug the poller never gets a warning message, only the dbstatus is affected. I am escalating the call, Amy has requested that another bug be filed. --- Bug #4258 is the old version of this problem. It is slated to be fixed in 4.5 9/21/98 Amy _ This defect has been seen in ticket #23277 also. The customer is concerned that Ingres will think it has run out of disk space when it really hasnt. 5/5/2000 9:04:27 PM SysAdmin Similar to the manner in which the Web module uses the $(DATE) prefix, it would be helpful if the D/B save functionality (either scheduled or through the GUI) provided this same mechanism. This would avoid having customers overwrite their D/Bs on a nightly basis as the name would always be the same in the job scheduler and might be the same if invoked from the GUI. This enhancement was requested by Irridium (Call ticket #16046) W5/5/2000 9:04:28 PM SysAdmin Narrative: Customer running in a distributed polling environment across multiple time zones. The rollups started failing shortly after upgrade to 4.1. Initial 4.0.1 system had a custom setting for TZ for the remote pollers to keep them all in the same time zone. When the customer upgraded the settings for TZ were lost and the customers stats0 tables grew to exceed the tx log file size allotment during rollup time. Error from Rollups log file. Error: Unable to execute 'DELETE FROM nh_stats0_906440399 WHERE sample_time <= 906357599' (E_US1264 The query has been aborted. errlog.log snip ::[56412 , 00ADA040]: Tue Sep 29 12:04:55 1998 E_QE0022_QUERY_ABORTED The query has been aborted. ::[56412 , 00ADA040]: Tue Sep 29 12:04:55 1998 E_DM9059_TRAN_FORCE_ABORT The transaction (00003575, 357DFFDC) in database nethealth is about to be force aborted. Further messages may or may not follow describing the force abort in more detail. ::[56412 , 00ADA040]: Tue Sep 29 16:11:58 1998 W_DM5422_IIDBDB_NOT_JOURNALED WARNING: The iidbdb is being opened but journaling is not enabled; this is not recommended. ::[56412 , 00AB4880]: Tue Sep 29 17:05:16 1998 W_DM5422_IIDBDB_NOT_JOURNALED WARNING: The iidbdb is being opened but journaling is not enabled; this is not recommended. ::[56412 Total number of elements = 5100... 7 Days Raw. 6 weeks hourly 26 week of daily. The rollups started failing on the 23rd or the 24th. 4.1 was applied on the first week of September. ~4 remote pollers ingres tx log file was at 512MB and rollups failed for force abort. currently trying rollups with ingres tx log at 1gb NH 4.1.m Solaris 2.5.1 Ultra-Sparc-2 (5 CPUs). 512MB ram 1GB+ Swap. MCI John Klien. 719.535.3666 5/5/2000 9:04:30 PM SysAdmin Customers were keeping as polled for 1 week ore more, and everything was fine prior to time change. Reports were showing data for 5 minute intervals. Now, all data prior to the daylight savings time change is rolled up in to one hour samples even though the as polled data is being kept for longer than time period since the daylight savings time change. Reports are available from snorman if needed. Also confirmed by jwolf. We also need to identify if there are any detrimental effects to the system asap. mjc--10/26/98 8/11/2000 9:25:14 AM jay Changed from "fixed" to "closed" due to new procudures. 8/22/2000 9:53:30 AM bhinkel Changed status to Closed, as the fix was included in a product release. 5/5/2000 9:04:31 PM SysAdmin He would like this option so that he doesn't have to worry about the scheduler going off when he loads the database. 5/5/2000 9:04:33 PM SysAdmin Customer John Rawson British Telecom the Database Status window can be totally misleading since the date in "Last Rollup" does not mean last "successful" roll-up, just the last time it tried to run - you have to look at the Statistics_Rollup.100000.log to see if it actually worked. (This problem was why I left the sorting out of the dB for so long - the status seemed to saying it was ok.) --- This was approved declined and updated per Steve McAfee. 3/26/99 Amy 3/29/99 Steve McAfee Reopened after receiving more information from Pete Allen: Hi Steve, Happy to help in any wa< y. Doesn't the ProbT0000004727 ticket number indicate that it -is- in Remedy? Sorry, I'm a field guy, so I'm not fluent in the details of how issues are tracked. The scenario that leads to data loss is thus: ACME Telco runs several headless (i.e. no-monitors) Sun Ultra Network Health workstations in the 30-someodd Sun workstation centre in England. Network Health is one of -many- applications that this telco operates on many different machines running many different OS's. Strangely, for some reason, the roll-ups begin to fail sometime. We don't know exactly when, because NH doesn't let you know when it's in trouble, but this user experienced severe database corruption due to distributed poller "irregularities". When we checked the db status to find out when the last rollups occurred, we were conveniently informed that they completed happily the day before. Somehow we were expected to realize that this didn't actually mean that the rollups did occur, but merely that the process had started (my, that's helpful). In reality, the telco darn near lost their entire database, the entirety of which was PAID CUSTOMER data. They almost lost their entire database, of course, because the rollups were NOT occurring, and the database was in the slow process of performing a core meltdown. These people take a rather dim view of database "glitches". This is extremely critical now. We have Production (with a capitol P) sites providing reports-for-money to customers. Downtime is -not- an option for these people. They look at our UIs and they expect the words to mean what they say. If the UI says a rollup occurred on a certain date, then that actually should mean that the rollup "occurred" on that date. This is real-world, customer-centric stuff. I don't mean to appear testy, but I would like you to offer the explanation for the above scenario so that I can learn how to explain to a mortal being why words don't mean what they say. With respect, Pete ====================================================== Pete Allen International Technical Manager Concord Communications, Inc. Voice : (508) 303-4301 GSM : (978) 902-2939 Fax : (508) 481-9772 Email : pete@concord.com SMS : 19789022939@omnipoint.net --- After further discussion a middle-tier of this function will be put into NH4.5. The last time the rollup succeeded will be unknown, but the Db Status output (UI or command line) will show FAILED instead of the date if the last attempt was not successful. 3/30/99 Amy 5/5/2000 9:04:34 PM SysAdmin The customer wanted to know if he could have the ability to Save his database by groups so that he could send sections of the database to other users that do not need access to the whole database. V5/5/2000 9:04:36 PM SysAdmin excerpt from customer ticket: One of our customers were interested in having a "nhRemoteSaveDb"- script that used secure copy (i.e. scp) instead of normal FTP for the transmission so I had a closer look on the script the other day. I found the following excerpt in the code: > GET_ING_ERROR="cat ${ING_ERROR_FILE} | sed -e '1d' | sed -e "/row/d" | > sed -e "/Executing/d" | sed -e '/... ... [ 0-9][0-9] [0-9][0-9]:[0-9][0-9]:[0-9][0-9] 19[0-9][0-9]/d'" in the initVars() function. Matching for 19[0-9][0-9] doesn't look like Y2K complient to me!? 5/5/2000 9:04:37 PM SysAdmin We have this enviroment set up in the 3rd floor training room. It will only be set up until 5:00pm Friday March 5th. So if you want to take advantage of this please come down. I have an NT machine on the same time and time zone. (NT -GMT +2 (South African time) and the Solaris TZ=GMT-2). The Data base was saved in ASCII format on the NT machine and then ftped to the Solaris and loaded. A report for the same day, time, and element showed a difference in the time. The data appeared to sift backward 4 hours. We have duplicated this in our training room here at concord, however this is only going to be available for a limited time. (Until Friday 3/5/99 @ 5pm). 4958 is the bug where a user has been polling on NT in South Africa and they try to move the database to Solaris. When they move to Solaris and run the same trend reports over, the data shifts 10 hours. It looks like on NT, even though you have selected Pretoria, South Africa on the control panel, the system calls are returning timestamps relative to Pacific Standard Time (Redmond, Washington). So the data is shifted in the database. I have confirmed this by changing the Control Panel between South Africa and Pacific Standard Time and rerunning the same trend report for the same day, and it shows up identical. On 4/20 we asked Stephanie for a the customers database so we can test the repair script. -Vince Either this is a Nutcracker problem or a Microsoft problem. If it is the former, we have a hope of it already being fixed them fixing it. If it is the latter, I don't hold out much hope. 4/1/99 - It appears that the bug was Microsoft's but Nutcracker has worked around it in 4..1.5 The problem is that if the client upgrades to 4.1.5, all baseline information will shift. Investigating conversion of all timestamps in the database. > On 4/20 we asked Stephanie for the customers database > so we can test the repair script. -Vince 7/23: Mike C. and Jay W. are working together to see if they can modify a script to make the current solution work in a more foolproof manner. Main problem is that South Africa does not obsure DST. However, the Network Health product has been doing this at the installations in this country. So, some of there data is off by 1 hour. de-escalated by mjc. customer has script. feedback is minimal. dgray to keep working with customer/reseller. 5/5/2000 9:04:38 PM SysAdmin When a database was transfered from between two machines all of the groups did not come over in the database load. The router group did , but the LAN/WAN group did not. This was confirmed here in Tech Support where no group information was transferred. The workaround was to copy the $NH-HOME/reports/lanwan to the new machine. The following informaition is from the call ticket: Here are the issues I found using the SaveDB/LoadDB process. 1. The router groups got copied to and the lan/wan groups did not. The lan/wan groups were there in the tdb directory structure but were not copied. 2. The router and lan/wan groups that were brought over were all lower case where the originals were upper and lower case mixed. This caused the scheduled reports to not run. 3. The report types were converted to lower case causing the scheduled reports to not run. 5/5/2000 9:04:39 PM SysAdmin Customer requests that Ingres database conversions are separate from the upgrade to later revisions of Network Health. The customer believe that if the conversion was separate from the upgrade it would be possible to skip revisions during upgrades. 9/1/2001 4:19:04 AM AR_ESCALATOR Administrative change. This ticket has been closed because it has been open for more than one year. If this is still a requirement, it will need to be justified and re-submitted as an Enhancement Request. y5/5/2000 9:04:40 PM SysAdmin I would like to see a time stamp in the Database save log to confirm when the save completed. I had a customer that could not confirm that that the save had completed. There was no message on the console that it had completed, and he saw no time stamp in the log to let him know that it had completed. This would clear up any misunderstanding. c5/5/2000 9:04:40 PM SysAdmin Excerpt from customer ticket: The customer would like to - only save certain elements in the database to a backup directory or - only load certain elements from a backup directory into the database i.e. select, which elements to include into the database without having to load them all and then delete the unwanted elements. 9/1/2001 4:19:04 AM AR_ESCALATOR Administrative change. This ticket has been closed because it has been open for more than one year. If this is still a requirement, it will< need to be justified and re-submitted as an Enhancement Request. 5/5/2000 9:04:42 PM SysAdmin Boeing would like to retain and run reports on as polled data of RAS information for at least 4 to 6 weeks, after that it would be part of the regular roll ups. They do not want to retain the rest of the network data roll up for more than the Concord defaults. 9/1/2001 4:19:05 AM AR_ESCALATOR Administrative change. This ticket has been closed because it has been open for more than one year. If this is still a requirement, it will need to be justified and re-submitted as an Enhancement Request. 5/5/2000 9:04:44 PM SysAdmin Bob Michie of Toys R Us at 973-331-2885 or michieb@toysrus.com End up having a problem with doing a database ascii save when using a relative path instead of a full path with the nhSaveDb. The error he was getting when doing this was: Unloading the sample data . . . Fatal Internal Error: Ok. (none/) $ INTERNAL: Couldn't open message file '/opt/concord/sys/messageText.sys' Or at least have the error indicate in some way that possibly a full path was not used or no directory found. Even though it states this in the Appendix A on page A-30 for the nhSaveDb, we need this to stand out more. (maybe have a bullet in front of it) or Bold that part of it. 9/1/2001 4:19:06 AM AR_ESCALATOR Administrative change. This ticket has been closed because it has been open for more than one year. If this is still a requirement, it will need to be justified and re-submitted as an Enhancement Request. [5/5/2000 9:04:47 PM SysAdmin nhReset changes the permissions on the errlog.log file to READ only, the system cant write to the errlog.log file any longer. In the nhReset script there is a variable set, NULL_DEVICE=/dev/nul, it is missing an "l" in null. This fix worked on my NT machine. Also, removing the variable altogther and using a path, /dev/null, in the rm command worked as well. (from the nhStartDb script) NULL_DEVICE=/dev/nul # if this is NT, then we need to launch this in a station that is safe from its effects. if $NH_SAFE; then echo "Stopping Network Health service" $NH_HOME/bin/sys/nhiNtScm stop "Network Health" echo "Stopping Ingres service" $NH_HOME/bin/sys/nhiNtScm stop "CA-OpenIngres_Database" sleep 5 # Truncate the ingres error log. ERRLOG=$II_SYSTEM/ingres/files/errlog.log if [ -f $ERRLOG ]; then tail -500 $ERRLOG >$ERRLOG.trunc 2>&1 && rm $ERRLOG >$NULL_DEVICE 2>&1 && mv $ERRLOG.trunc $ERRLOG >$NULL_DEVICE 2>&1 fi Tue Oct 26 15:10:15 EDT 1999 - manthony This is NOT reproducible on nh4.5 5/5/2000 9:04:48 PM SysAdmin In system messages window: 'The database does not contain enough space for more 'conversation' data, dropping this poll.' df -k output shows approximately 1.4 gig remaining for the database. Customer is Jerry Puoplo of John Hancock Mutual Life. Tue Oct 26 15:10:15 EDT 1999 - manthony Customer can set the environment variable NH_POLL_SPACECHECK = "yes". This will stop the poller from checking space on the disk. This appears to be a poller issue NOT a database issue. 5/5/2000 9:04:48 PM SysAdmin Customer has recently installed N/H V4.5 on their machine. They are attempting to load their saved N/H V4.5 Beta D/B into the production release, however, it is failing consistently. This needs to be escalated because of customer sensitivity. This problem has been seen at the Roadrunner group. 9/1/99 Robin Trei and Mike Anthony have been working on this. Assigned to two days ago but she's not added to rememdy assigned field drop down. Assigning to mike anthony on behalf of robin. 9/29 This should be fixed. Support has action item to discover if Roadrunner has any other problems. 2-2-00 rlt. I am closing this. 5/5/2000 9:04:50 PM SysAdmin presentation.var and serviceStyles.sde are NOT migrated with db save. 4.5 migrations issue Old machine: NH 4.1.5 Solaris 2.5.1 New machine: NH 4.5 Solaris 2.6 Saved the database from old machine and tarred and copied it to new machine. Did DB load. The new machine has problems running some custom reports due to the fact that presentation.var and serviceStyles.sde are NOT migrated with db save. After copying the two files manually, everything worked fine. Tue Oct 26 12:56:34 EDT 1999 - manthony Changes to the file serviceStyles.sde are not supported so on database load changes from a previous revision will be lost. The presentation.var file should be copied on a database load so for that file this is a problem. (Note to self: To fix in saAppUtils.C checkReplaceFile() ) Approximate fix time: 2h manthony 4/25/00 Added fix to nhiLoadDb that will copy the presentation.vars file. 5/5/2000 9:04:50 PM SysAdmin Data Base Status incorrectly calculates database size. From thet nhDbStatus output I get the following numbers: Database Size: 4294967295 bytes. This equates to approximately 4.3 GB. Free Disk Space: 533250000 bytes. This equates to approximately 533 MB. From the ingprenv file I see that the database is installed on /data/nh/idb. From the df -k I get the following numbers: Disk Space (total) 8705501 kbytes. This equates to approximately 8.7 GB. Disk Space (used) 8084812 kbytes. This equates to approximately 8.1 GB. Disk Space (avail) 533634 kbytes. This equates to approximately 533 MB. Frree disk space is being calculated correctly. I then stuck the output from the ls -alR file into an Excel spreadsheet and added the values for each individual file in the /data/idb/ingres/data/default/nethealth directory. The numbers added up to 8057917440 bytes. This equates to approximately 8.1 GB. It certainly looks as though Network Health is incorrectly calculating the database size. The files I used for calculating the database size are located on \\Voyagerii\Escalated Tickets\25000-25999\25140 10/25 Given Oracle situation, Mike and I question whether this should be fixed. We think the priority is set too high. Also: # 4561, 5367 (owned by wilson), 5523,5534,5838 are related. escalated by mjc. Fri Dec 10 11:55:05 EST 1999 - manthony Found that we are overflowing a variable when calculating DB size. Any DB that is greater than 4Gig will get reported as being 4Gig in size. I recommend that this be fixed in the next patch for 4.5 work-around du -sk $II_SYSTEM/ingres/data/default/ This will be about 4 hours to fix and test. Mon Dec 20 18:23:18 EST 1999 Checked fix into 4.5.1 and it will be available in patch 10. 8/22/2000 9:57:51 AM bhinkel Changed status to Closed, as the fix was included in a product release (4.5.1). 35/5/2000 9:04:50 PM SysAdmin Per Vince Fortin, the following problem is being logged and escalated. The following is the output from the customer's recent Statistics Rollup log: Begin processing (09/15/99 16:06:46). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Wed Sep 15 12:17:16 1999) ). 1. The customer is using a remote polling environment (1 remote poller). 2. Statistics rollups have been failing consistently for several weeks now. NOTE: The first failure occurred on July 3, however the call was first logged in August. 3. The errlog.log shows no errors at all. 4. The "select from iitables..." shows that all tables are indexed. 5. Customer is running the rollup in debug mode (printqry) and has found problems with each date since July 3. 6. Customer has been instructed to remove the offending dates from the D/B by deleting it from the nh_rlp_boundary table. However, each time the next successive date (relative to the next table) as provided by the printqry indicates a problem. This will require assistance from Robin, Mike Anthony and/or Jay. All relevant files are located in the Escalated Tickets directory #24304 and sparrow:/ftp/incoming/T24304/* Thu Sep 16 16:44:28 EDT 1999 - manthony Sent a script to support that will identify stats tables with duplicate rows.< Also having customer check that nhiCheckDupStats is installed. Will need to also identify which 0-level stat tables have already been rolled up so that this problem does not continue to happen. Waiting for results of script....... Wed Sep 22 09:35:17 EDT 1999 - manthony Results of script showed NO duplicates in nh_stats0 tables. Requested dumps of nh_rlp_boundary table and an nh_stats0 table that was causing rollup failures. Customer FTP'd both dumps. The nh_stats0 table dump was corrupted, asked for re-send of this file. Waiting for this file. Tue Sep 28 08:57:53 EDT 1999 - manthony Received customer info and loaded into a 415 database. Performed a rollup using this data and found that there are 5 elements that seem to be constantly polled twice. This looks like a poller/remote-poller issue and will continue to cause corrupted data. However we will provide a script that will in effect make it possible for rollups to take place while the poller issue is being looked at. Tue Sep 28 11:45:23 EDT 1999 - manthony Sent customer a modified nhFetchDb script that has -x turned on. This should verify whether or not customer has duplicate element ids at more than one site. Awaiting customer info. Thu Sep 30 16:43:46 EDT 1999 - manthony Customer does not have nhiCheckDupStats executable. Asked Customer support to get them up to rev to get this exe. Also sent customer a script to get the DB to a point where rollups are possible. Mon Oct 4 12:47 EDT 1999 - jwolf Gave Don a script to get all customer statistics tables. It seems like we discover an issue and solve it and a new problem occurs on a subsequent raw day. Customer is reluctant to give us whole database, but this script will just pull the statistics tables. Also provided Don with a series of data files that I want as well. Mon Oct 4 5:00 EDT 1999 - jwolf Talked with Don, customer has been having power outages and could not execute the script. Tues Oct 5 10:45 EDT 1999 - jwolf Talked to Don, Network Health systems are not coming back up because of power outages. They cannot execute the scripts. Don has paged, voice-mailed, e-mailed Jeff Greene as he was supposed to be on-site today. 5/5/2000 9:04:53 PM SysAdmin Customer needs to unreference nodes to be able to use traffic accountant he has been unable to use the product for the last month. First conversations rollup of each calendar day fails with error. Fatal Error: Assertion for '_txnLevel == 0' failed, exiting (~DuDatabase - Unmatched transaction level in file ../DuDatabase.C, line 117). (cu/cuAssert) This is happening each day just after midnight, during the first conversations rollup of each calendar day. It appears to be connected with the implementation of the NH_UNREF_NODE_LIMIT variable. All other conversations rollups during the day succeed. Having the customer run the following command to gain additional debug output: $NH_HOME\bin\sys\nhiDialogRollup -u - d -Dm du:cdb -Dfall -Dt > $NH_HOME\tmp\roll2.txt 3>&1 4>&1 I will place files on \\voyagerii\escalated tickets\23000-23999\23509 when I receive them. HUGE 1.3MB debug of conversations rollup now available!!!! 11-17-99 Have started to investigate. RLT. 11-18-99 Fixing this will require a patch and a script to be run at the customer site. However, once the script is run, the customer should not see any more errors. I have requested the customer's database for final verification, and will then be able to give them a script. I have the database and examined it. Unfortunately, in the course of testing the proposed fix, I envcountered another problem\bug which is blocking my test. I am currently investigating that issue. In the interim, I propose that we send the script to adjust the duplicates so that they can proceed. 5/5/2000 9:04:53 PM SysAdmin Frequently they will swap an entire router and bring one back to the shop for repairs. when he rediscovers the new router he wants to keep the history form the prevoius one to the new one. 9/1/2001 4:19:09 AM AR_ESCALATOR Administrative change. This ticket has been closed because it has been open for more than one year. If this is still a requirement, it will need to be justified and re-submitted as an Enhancement Request. K5/5/2000 9:04:53 PM SysAdmin Excerpt from cusotmer ticket: When the load starts, it should check/calculate the available disk space in the database area. If the available space is too small, the command should display a warning and abort. NetHealth ought to be able to do an estimate and print a warning if it thinks the available space could be too small. 10/25 Given Oracle situation, Mike and Iquestion whether time should be spent on this. Getting information is problematical, and extremely ingres specific. Also: # 4561, 5367 (owned by wilson), 5523,5534,5838 are related. 9/1/2001 4:19:09 AM AR_ESCALATOR Administrative change. This ticket has been closed because it has been open for more than one year. If this is still a requirement, it will need to be justified and re-submitted as an Enhancement Request. 5/5/2000 9:04:55 PM SysAdmin John Warner called in and said he would like to the load.log populate with the output of the database load while the load is ongoing so that he can monitor the progress. Currently the load.log only gets wriitten to when the load is finished. 9/1/2001 4:19:10 AM AR_ESCALATOR Administrative change. This ticket has been closed because it has been open for more than one year. If this is still a requirement, it will need to be justified and re-submitted as an Enhancement Request. 5/5/2000 9:04:59 PM SysAdmin Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. During a DB load I get the following errors: Loading the sample data . . . Creating the Table Structures and Indices . . . Creating the Table Structures and Indices for sample tables . . . Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Tue Nov 02 16:21:36 1999) ). Load of database 'nethealth' for user 'nethealth' was unsuccessful. Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Tue Nov 02 16:21:36 1999) ). Error: The program nhiLoadDb failed. ******************************************************************************************************************* Metlife has 2 machines with Nh on them. - The "Old" Machine. - Dell, Running NT4.0. - NH 4.5 P01, D02. - Db brought over from HP-UX, retired machine. - Up and polling. - The "New" machine. - Compaq Running NT 4.0. - 4.5 P04, D03. - Not polling, this is the box with the database problem. - The database was moved from the old machine to the new machine. - The move was from a 4.5 install to a 4.5 install getting dups on load no errors in data analisys on this database. ********************************************************************************************************************* - Ran findDupTable, Found nothing. - Ran fixDups and fixDups2, both require table names and he has none. They fail. - Ran removeElementDups2, seems to work. - Ran nhSaveDb and nhLoadDb. Worked on the old install. - Has all data on the old install, missing historical data on the new install. - He had the nhServer off during the first save on the new install. Maybe a bug. - With the nhServer off the save was 7 mg with it on it was 117 mg. - The nhLoadDb fails again. The production install is up and running. ********************************************************************************************************************* Mon Nov 8 14:48:40 EST 1999 - manthony Requested support get database from customer. Tue Nov 9 14:16:36 EST 1999 - manthony Customer is FED-Xing us the DB, for some reason he cannot FTP it. At the same time he is running a special LOADER that will tell us exactly < what tables are having the problem. Customer is also running finddups script. Awaiting results. Tue Nov 16 15:30:13 EST 1999 Sent customer script to clean DB. Awaiting feedback. 8/11/2000 9:51:26 AM manthony This should be marked as closed not fixed. 5/5/2000 9:04:59 PM SysAdmin The customer is on Solaris 2.6 and I have tested this on my NT machine also. The customer BB-Data is seeing the attached output from the nhDbStatus program and they would like an explanation of the statistics. As you can see, the complete database size is given as 4.29 GB, however if you add all the values for "Statistics data," "Conversations data", "Rolled up Conversations" and "Rolled up Top Conversations", it comes to 1003518704 bytes. The customer is worried that something may be wrong with the database. Database Name: nethealth Database Size: 4294967295 bytes RDBMS Version: OI 2.0/9712 (su4.us5/00) Location Name Free Space Path +-------------------+------------------+---------------------------------+ | ii_database | 6743838000.00 bytes | /DATA/nethealth/idb | +-------------------+------------------+---------------------------------+ Statistics Data: Number of Elements: 9669 Database Size: bytes Location(s): ii_database Latest Entry: 27/10/1999 09:32:50 Earliest Entry: 18/10/1998 00:00:00 Last Roll up: 27/10/1999 00:44:46 Conversations Data: Number of Probes: 32 Number of Nodes: 36368 As Polled: Database Size: 240216000 bytes Location(s): ii_database Latest Entry: 27/10/1999 09:15:00 Earliest Entry: 17/10/1999 00:10:44 Last Roll up: 27/10/1999 00:51:53 Rolled up Conversations: Database Size: 393746000 bytes Location(s): ii_database Latest Entry: 16/10/1999 23:44:50 Earliest Entry: 18/10/1998 00:21:35 Rolled up Top Conversations: Database Size: 223962000 bytes Location(s): ii_database Latest Entry: 16/10/1999 23:44:50 Earliest Entry: 18/10/1998 00:21:32 5/5/2000 9:05:00 PM SysAdmin Customer has a database that is named nethealth1 and cannot schedule a database save via the GUI because the scheduler does not have the ability to allow any other name other than nethealth. The environment variable NH_RDBMS_NAME is correctly set to nethealth1 and when they run nhDbStatus it shows nethealth1, but there is no place to change the name in the scheduler. And when they try to run a scheduled database save they get the error message below: Here is the out put from scheduled database save log file: ----- Job started by Scheduler at '09/11/1999 05:01:42'. ----- ----- $NH_HOME/bin/sys/nhiSaveDb -u $NH_USER -d $NH_RDBMS_NAME -p /export/home/nethealth/db/save/daily nethealth ----- Error: Unknown or invalid argument 'nethealth'. Error: Unknown or invalid argument 'nethealth'. ----- Scheduled Job ended at '09/11/1999 05:01:47'. Database saves can be successfully run from the command line but the customer cannot schedule them. Not a bug!!!!!!! 85/5/2000 9:05:00 PM SysAdmin Error: Unable to execute 'CREATE TABLE nh_stats1_940132799 AS SELECT * FROM NH_RLP_STATS' (E_US07DA Duplicate object name 'nh_stats1_940132799'. Requested per Micheal Anthony: - The ingres errlog.log file - The database. - sqlout file. - results.txt - The output of the following command: hendrix%sql nethealth *help \g Database is being saved and will be FTP'ed today. Files are in the escalted tickets dir on voyagerii. Mon Nov 15 13:03:40 EST 1999 - manthony Requested the latest errlog.log from customer. Tue Nov 16 15:30:13 EST 1999 -manthony Customer changed timezones several times since Oct. 20th. This caused a table to be indexed that should not be, which in turn is causing ingres to run out of locks on Bulk Loads (inserts). At this point customer is willing to drop data proir to Nov 1st. Sent the customer a script to drop 1 stats1 table and he is now running rollups. Awating feedback. Don 11/23 > -----Original Message----- > From: Andrew Gerber [mailto:gerber@qwest.net] > Sent: Monday, November 22, 1999 10:13 PM > To: Gray, Don > Subject: RE: Ticket #27949 > > > Yes, we are all fine. Thanks. Close the ticket. > > Andy > > -----Original Message----- > From: Gray, Don [mailto:DGray@concord.com] > Sent: Monday, November 22, 1999 4:31 PM > To: 'Andrew.Gerber@qwest.net' > Subject: Ticket #27949 > > > Andrew, > Did the rollups for 4.1.5 run this weekend ? Can we close the call ? > > Don > > > ========================================================== 5/5/2000 9:05:00 PM SysAdmin Begin processing (10/25/1999 08:00:13 PM). Error: Append to table nh_stats2_936511199 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 144 rows not copied because duplicate key detected. The database is being FTP'ed to us at this time per reqeust of Micheal Anthony. The database is up on the FTP server: 27378_4_5.zip. Tue Nov 16 15:30:13 EST 1999 - manthony Analyzed database and found that there are duplicates in three stats tables. The dups have 3 distinct chanracteristsics: (1) Alternate latency (2) Availability backfill (3) Full row dups. Developed a script to clean all of these problems up. Ran it on the DB we hav loaded here. Now running rollups. If all goes well, will send customer scripts. Tue Nov 16 17:41:18 EST 1999 - manthony Sent customer clean-up scripts. If all goes well they can run rollups. Awaiting feedback q5/5/2000 9:05:00 PM SysAdmin There is a hole in the way rollups work that allow for duplicate data to be inserted into the statistics tables. The way the transactions in rollups work is: (1) Create Table; Commit (2) Bulk Load the Table; Commit (3) Index the table; Commit (4) Update the rollup boundary table; Commit (5) Drop the old stats table; Commit; If rollups fail for ANY reason (system crash, bug in the code, transaction log full, etc.) after step (2) is complete, but before step (3) is complete rollups will create duplicates in the stats table the next time they are run. -Or- if a save is run while rollups are running and a stats table is between step 2 and 3 and the database is loaded rollups will create duplicates in the stats table the next time they run. This is a problem which should be fixed ASAP, but is HIGH RISK. Tue Dec 21 18:50:40 EST 1999 - manthony Combined the Append and Indexing of rollup tables into one transaction so that duplicates will be less likely to be inserted into the tables and so that we decrease the probability that we attempt to insert duplicates. 5/5/2000 9:05:02 PM SysAdmin In the Environment under NH4.5 (Japanese) P04/D04 in WindowsNT, Database Save and Database Load can not handle from menu: 1. Database > Save Database if we use Japanese character in the directory name, the dialog shows as a garbage in the Directory hierarchy Windows. 2. Database > Load Database Same Phanomena as above. 3. Database > Load Database Also in somecase can not load database but Console show the Database Load Complete message. When we use "$B$F$9$H!I(J used as a name of the (J directory the problem does not occerred. When we use "NetworkHealth$B%f!<%6%G!<(J $B%?(J" as the directory name,(J The problem has occurred. I have attached 4 files in a zip. Consolemessage.txt is a copy of the console's system message window. it shows Database load complete without error in a second. Load.log.bak contains error message in the case of using Japanese character in the directory name. Load.log is a successful message to load same database executed from command line. command.txt shows the command line when I have succeeded to load database. Screen shots and logs are located on Voyager in the escalated tickets dicrectory. 5/5/2000 9:05:03 PM SysAdmin Rick Glasheen is using this utility at a customer site and during the import portion of the program he gets the following error: Excerpt from import log: Error: Sql Error < occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Thu Dec 2 16:35:00 1999) ). Don asked that I bug this as this should behave same as the recently revised nhLoadDb command. The nh_stats1_XXXXXXXX table and output from help \g | sql nethealth command (file.txt)is in escalated ticket directory: \\VoyagerII\Escalated Tickets\28000-28999\28747 directory Dec 8 1999 Looks like stats data file is corrupt in the above directory. Asked for a resend. 01/04/2000 We should do whatever main product does when it encounters duplicates, which I think is to create a non-unique index on the table. This will require looking at all the insert code in import utility to find all places this can occur - estimate 2 days. Estimate: 2d 4/20/00 RobL - since we have an estimate, I'm moving this to evaluated I would like Ekta to fix this perhaps as part of the work she is doing for performance. rlt/mp 4-20 This work should be done for any table, not just a chew a stats, dlg, bsln-- any table that we load that has duplicates we should check for this error condition, set it aside and go on without stopping. 5/10/2000 4:12:26 PM rtrei I reset this problem's assigned priority from critical to low: This is a split/merge fix which will not go out with the Beta or 4.7 release anyway. I just found out the bug fix was done in 4.5 and must be merged up to 4.6 and 4.7. No problem there, but I believe it still has to go through the tribunal process, so will not be able to be checked in until mid to late next week. Since this really isn't part of the 4.7 release, it made sense to adjust the assigned priority and let the process take its normal course. 5/11/2000 9:12:27 PM rlindberg by triage committee these tickets are moved to postponed. 5/16/2000 2:01:48 PM vhf This problem still exists. It is not a showstopped for 4.7 AND we should not merge the fix into 4.7 mainline now, but the ticket should remain open until the fix is merged to 4.7 and confirmed fixed. 5/18/2000 6:33:26 PM smcafee Moved Low priority issues to postponed. Needs review for readme. 6/27/2000 11:29:38 AM rtrei At this point, believe code has been merged to 4.6 but not to 4.7, this should happen when the remainder of the 4.7 split/merge work is aimed towards early August. (Remeber, split/merge NOT part of 4.7 formal release.) 1/22/2001 12:36:40 PM rtrei Actually, Phil Adams is starting work on 4.8 split/merge now. I recommend he investigate this and determine if it is done, or can be moved into the 4.8 or 5.0 schedule. If it is simply a matter of a merge, then it is a half day task. Otherwise, I estimate 3-4 days. 1/22/2001 12:37:04 PM rtrei Forgot to sset this back to evaluated 5/3/2001 3:07:18 PM lemmon reassigned to yzhang because Phil Adams is no longer with Concord 6/18/2001 12:49:07 PM lemmon recommend that this be declined. 5/5/2000 9:05:03 PM SysAdmin Customer hit the 64 bit counter bug. Before the problem could be identified and corrected, bad values (very high) were inserted in to the db. Customer has 6 months of historical data and does not want to skew current reports with the unreasonable numbers. Need a series of SQL To remove the errant values for a single element for a few days time. Enterprise Network Systems formerly TLA Pete Silvestre Wed Dec 8 13:38:06 EST 1999 - manthony Sent the customer a script to delete requested elements. Customer sent element names. Awaiting feedback. Fri Dec 10 10:01:11 EST 1999 - manthony Customer stated that script fixed the problem. Changing status to fixed. 8/11/2000 9:54:19 AM manthony This should be closed not fixed. 5/5/2000 9:05:04 PM SysAdmin Rick Glasheen has spoken to Mike P about this already. He would like the migration utility to handle specific requests to recover stats tables. 9/1/2001 4:19:11 AM AR_ESCALATOR Administrative change. This ticket has been closed because it has been open for more than one year. If this is still a requirement, it will need to be justified and re-submitted as an Enhancement Request. 5/5/2000 9:05:04 PM SysAdmin During a DB Load at Bear Stearns, Jon Warner encountered the error message: Loading the sample data . . . Updating a prior version 4.1 database . . . Begin processing Copying IP Address into the Database (12/08/1999 09:24:55 PM). End processing Copying IP Address into the Database (12/08/1999 09:30:16 PM). Creating the Table Structures and Indices . . . Non-Fatal database error on object: NH_HOURLY_HEALTH 08-Dec-1999 21:45:06 - Database error: -39100, E_QE0083 Error modifying a table. (Wed Dec 8 21:45:05 1999) OS=Solaris 2.6 The database is on the FTP server in /ftp/incoming/29012.tar. ============================================================================================================== Mon Dec 13 17:56:08 EST 1999 - manthony Need to know if customer wants old DB cleaned or NEW 4.5 DB cleaned. Customer support getting this info. Mon Dec 13 19:16:50 EST 1999 - manthony Actually have loaded this table just fine. I believe that the user ran out of disk space at the time of the load. The user was trying to load the db on a pretty full disk... 648Meg available. Probably not enough space to run the modify on this table which is about 550Meg. Changing to fixed. -EEG 12/28/99, If this bug was generated because the customer ran out of disk space then it it is really a NoBug. And should not be posted as 'Fixed'. Changed to NoBug. b5/5/2000 9:05:06 PM SysAdmin The following is a list of feature request made by POC installations as well as current users when visiting their sites after 4.5 was installed. 7. Scheduling several rollup periods based on configured Service Profiles. This would allow some reports to have longer As Polled Data and other to have shorter As Polled Data. 9/1/2001 4:19:11 AM AR_ESCALATOR Administrative change. This ticket has been closed because it has been open for more than one year. If this is still a requirement, it will need to be justified and re-submitted as an Enhancement Request. 5/5/2000 9:05:06 PM SysAdmin Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. - Customer recives this error in the nhStatsRollup.log file. - The output of help\g and the new findDups script are on voyagerii in the escalated tickets directory. - For Andrew Gerber at Qwest. ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- From the nhStatsRollup.log. ------------------------------------------ ----- Job started by Scheduler at '12/17/1999 08:00:21 PM'. ----- ----- $NH_HOME/bin/sys/nhiRollupDb -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (12/17/1999 08:00:22 PM). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Fri Dec 17 22:08:52 1999) ). ----- Scheduled Job ended at '12/17/1999 08:08:53 PM'. ----- ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- From the errlog.log file. ------------------------------------ E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. Associated error messages which provide more detailed information about the problem can be found in the error log, II_CONFIG:errlog.log -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Tue Dec 28 09:32:21 EST 1999 - manthony Sent customer < a script to clean DB. Awaiting feedback. -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Customer said to close ticket, this is all set now. 5/5/2000 9:05:07 PM SysAdmin Conversations Rollup failure. Error: Error: Unable to execute 'MODIFY nh_dlg1s_938469599 TO BTREE UNIQUE ON sample_time, dlg_src_id, nap_id, proto_id WITH FILLFACTOR = 100, LEAFFILL = 100, NONLEAFFILL = 100' (E_US1591 MODIFY: table could not be modified because rows contain duplicate keys.(Tue Dec 28 15:00:36 1999)). No table nh_dlg1s_938469599 exists in "help\g" output. Have attempted to remove references to this table from nh_rlp_boundary. No change in rollup error. No conversations rollups in a long time. (60 days) We are in danger of filling up the partition. Tue Jan 4 10:05:21 EST 2000 - manthony Yesterday requested either the database or access to the system. Waiting for CS. 5/5/2000 9:05:07 PM SysAdmin Customer wants system log save scheduled automatically with installation of Network Health. Customer wants to be able to schedule system log save as well. 9/1/2001 4:19:12 AM AR_ESCALATOR Administrative change. This ticket has been closed because it has been open for more than one year. If this is still a requirement, it will need to be justified and re-submitted as an Enhancement Request. d5/5/2000 9:05:08 PM SysAdmin Error duriing upgrade or load of database: Cannot convert column 'element_class' to tuple format. - Solaris 2.6 - Customer attempted to upgrade to 4.5, recived this error. - Suspected a lack of disk space was the problem. - Customer added 9 GB of space to the partition. - Finaly installed 4.1.5 and successfuly loaded the database save to this install. - The database can be loaded to 4.1.5 but cannot be converted for 4.5. - Requested the database. Database is on the FTP server and is named: 29610.tar. ======================================================================================================================== rlt 1-4-00 The problem was that the customer had an illegal character in one of their element names. We suspect that it came in from Discover-- that their sysname has a '|' character in it. What they need to do is to change this name and remove the pipe: Running the following at the command prompt (from their NH_HOME directory where they have sourced nethealthrc.csh) should take care of the problem: echo "update nh_element set name = 'peer1.CGX1-RH-A6/0/0-:Qwest:_xxx__Ameritech_xxx:_To:_Ame' where element_id = 1002961;commit\g"|sql nethealth They will also need to update the name in their poller.cfg file: Go from: peer1.CGX1-RH-A6/0/0-:Qwest:_xxx_|_Ameritech_xxx:_To:_Ame To: peer1.CGX1-RH-A6/0/0-:Qwest:_xxx__Ameritech_xxx:_To:_Ame Let me know if you need directions on how to do this. The customer can change the name to something else, but the name in the poller.cfg file and in the database must be the same. ********************************************************************************************************************************************** don 1/13/2000 Customer has worked around the illegal characters. He was using a custom naming script. 5/5/2000 9:05:09 PM SysAdmin The nhServer bounces periodically, in the System log appears: Server started succesfully Console initialization complete. Poller initialization complete (Conversations Poller). Poller initialization complete (Statistics Poller). Poller initialization complete (Import Poller) In the errlog.log of ingres appears this error: TRUHEALT::[II\INGRES\1f4 , 0000024f]: Sun Jan 09 23:31:20 2000 E_GC0001_ASSOC_FAIL Association failure: partner abruptly released association But not always the error is present in ingres the server restarts. There's no message about when the nhServer stops. This is a brief description of what i could see from customer's errlog.log and system.log I have remote access to the system. errlog.log TRUHEALT::[II\INGRES\1f4 , 0000024f]: Sun Jan 09 23:31:20 2000 E_GC0001_ASSOC_FAIL Association failure: partner abruptly released association system.log Sunday, January 09, 2000 23:32:00 System Event (nhiCfgServer) Server started successfully. Sunday, January 09, 2000 23:32:13 System Event (nhiConsole) Console initialization complete. Sunday, January 09, 2000 23:32:31 Poller initialization complete (Conversations Poller). Sunday, January 09, 2000 23:33:03 Poller initialization complete (Statistics Poller). Sunday, January 09, 2000 23:37:03 Poller initialization complete (Import Poller). errlog.log TRUHEALT::[II\INGRES\1f4 , 00000157]: Mon Jan 10 01:10:42 2000 E_GC0001_ASSOC_FAIL Association failure: partner abruptly released association errlog.log TRUHEALT::[II\INGRES\1f4 , 0000026a]: Mon Jan 10 04:05:03 2000 E_GC0001_ASSOC_FAIL Association failure: partner abruptly released association system.log Monday, January 10, 2000 04:05:58 System Event (nhiCfgServer) Server started successfully. Monday, January 10, 2000 04:06:08 System Event (nhiConsole) Console initialization complete. Monday, January 10, 2000 04:06:31 Poller initialization complete (Conversations Poller). Monday, January 10, 2000 04:07:02 Poller initialization complete (Statistics Poller). Monday, January 10, 2000 04:11:01 Poller initialization complete (Import Poller). errlog.log: Monday, January 10, 2000 06:00:59 Internal Error (nhiMsgServer) Database error: (E_US1194 Duplicate key on INSERT detected. errlog.log TRUHEALT::[II\INGRES\1f4 , 000001bb]: Mon Jan 10 08:51:56 2000 E_GC0001_ASSOC_FAIL Association failure: partner abruptly released association system.log Monday, January 10, 2000 08:52:16 System Event (nhiCfgServer) Server started successfully. Monday, January 10, 2000 08:52:27 System Event (nhiConsole) Console initialization complete. Monday, January 10, 2000 08:52:47 Poller initialization complete (Conversations Poller). Monday, January 10, 2000 08:53:18 Poller initialization complete (Statistics Poller). Monday, January 10, 2000 08:55:13 Poller initialization complete (Import Poller). The system also presented problems due a lack of space in transaction log of ingres, but we're not getting tese errors anymore, the file was upgraded to 600 Mb. *************************************************************************************** Jan 12: RLT This installation seems to have everything broken on it. It has problems with nhiCfgServer, nhReports, and nhiPoller. 1st let me speak to the database part, as that is my area: The message: "TRUHEALT::[II\INGRES\1f4 , 000001bb]: Mon Jan 10 08:51:56 2000 E_GC0001_ASSOC_FAIL Association failure: partner abruptly released association" just means that an application crashed, it does not imply any database problems at all. It isn't anything to worry about. That is, it is a symptom that an exe is crashing, but it isn't the root cause. There were 2kinds of duplicate messages in the errlog also. I don't think either of them are the root cause, either. I think they are just the result of so much being messed up that htings can't run properly. Of course I will continue to investigate, but for now they are a lower priority. Here is what I think is wrong: First, I think they did not follow the instructions in the README file for converting their custom report files, so that is why nhReport is failing. And, I think the big problem is due to a problem with the poller: This message is throughout the system log: Monday, January 03, 2000 17:46:26 Internal Error (nhiPoller[Net]) Expectation for '_requestCount < _ResponsesHandled' failed (Element pcfwha_fr-dlci-3-334 in file ../UplRespMgr.C, line 1031). (cu/) Monday, January 03, 2000 17:46:34 Internal Error (nhiPoller[Net]) Expectation for '_requestCount < _ResponsesHandled' failed (Element < pcfwha_fr-seg-1 in file ../UplRespMgr.C, line 1031). (cu/) Monday, January 03, 2000 17:47:17 A scheduled poll was missed, the next poll will occur now (Statistics Poller). Monday, January 03, 2000 17:50:22 A scheduled poll was missed, the next poll will occur now (Statistics Poller). Monday, January 03, 2000 17:53:26 It looks like nhiCfgServer is also crashing, but whther that is the root or the symptom, I'm not sure. Here is what has been done so far: Last night I asked Jose to review the Readme file, and get the custom reports converted per its directions. (THis is based on information from Dave Rich). This should solve the report problem and at least get us to the status of a cleanly finished install. I also asked Jose to run nhiIndexDiag against the database to locate and pull the duplicate information. As mentioned, I don't think this is the root cause, but it needed to be done, and wouldn't hurt. I will be working with Dave Shepard on the nhiPoller issue and will provide an update once we see something. I've asked Chris Ramos for his opinion on the nhiCfgServer messages. We may have to turn diagnostics on for all the pollers and servers for this system, but Support did a good job getting a lot of information, and I am working through that first. 01/13/2000 Customer has confirmed with Gordon Booman that when he removed all patches from system the reports are working again - this is on the "new" NT system, 4.5 with no patches. We are using advanced logging to get server logs when the system restarts so that we can determine which server has the problem. 01/14/2000 This is brief, will put in more detail on Monday, but did want to record that work is proceeding. We will be running with advanced logging for the msgServer and the DBserver over the weekend. 01/14/2000 Found this in system log: Thursday, January 13, 2000 17:51:22 System Event (nhiConsole) Console initialization complete. Thursday, January 13, 2000 17:51:24 Internal Error (nhiMsgServer) Unexpected Null value for '101' (possibly out of memory). (ms/) Thursday, January 13, 2000 17:51:24 Internal Error (nhiMsgServer) Unexpected Null value for '101' (possibly out of memory). (ms/) Thursday, January 13, 2000 17:51:24 Internal Error (nhiConsole) Assertion for 'obj->isA (name2(__,CdmGenItem))' failed, exiting (in file ../CdmGenItem.C, line 57). Thursday, January 13, 2000 17:51:25 Internal Error (nhiMsgServer) Unexpected Null value for '101' (possibly out of memory). (ms/) Thursday, January 13, 2000 17:51:58 System Event (nhiConsole) Console initialization complete. Console crashed coming up which also 8/18/2000 11:56:34 AM bhinkel Changed status to Closed, as the fix has been included in release 4.7. 5/5/2000 9:05:10 PM SysAdmin (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. - This error keeps appearing, even after running scripts to remove dups. - Also ran nhiIndexDiag and placed output on voyagerii under the ticket number, 29271. - HP-UX 10.20. - Nethealth 4.5, P10. escalated due to cust sensitivity. this is royal bank in canada and they have had multiple problems with nh. mjc |5/5/2000 9:05:11 PM SysAdmin (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. - This problem was corrected in call ticket 29271/6717. - The Stats_index had run clean for Jan. 18, and Jan 19, however, today's' check of the log revealed the following results again. - HP-UX 10.20 - Nethealth 4.5 P10. -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----- Job started by Scheduler at '20/01/2000 08:20:37'. ----- ----- $NH_HOME/bin/sys/nhiIndexStats -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (20/01/2000 08:20:38). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Thu Jan 20 08:21:14 2000) ). ----- Scheduled Job ended at '20/01/2000 08:21:15'. ----- ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5/5/2000 9:05:11 PM SysAdmin Cannot access table information due to a non-recoverable DMT_SHOW error. - They do a back-up every morning at 6:30. - He gets an email that tells him if it is successfull. - He got a message that it failed. - Fatal Internal Error: Unable to execute 'COPY TABLE nh_stats2_928652399 () INTO '/disk1/concord/db/save.tdb/nh_stats2_928652399'' (E_RD0060 Cannot access table information due to a non-recoverable DMT_SHOW error. - Solaris 2.6. - NH4.5.1. - Log files have been requested. - They have an archive db. ========================================================== 5/5/2000 9:05:11 PM SysAdmin Customer's Statistic Rollups are failing, the Statistic Indexing is failing and he gets the "E_US1194 Duplicate key on INSERT detected." in the system log. NH 4.5.1/ Patch 10/ HPUX10.20 //voyagerii/Escalated Tickets/30000/30335 From System Log:= Thursday, January 20, 2000 04:47:22 PM Error (nhiMsgServer) Database error: (E_US1194 Duplicate key on INSERT detected. (Thu Jan 20 09:47:22 2000) Thursday, January 20, 2000 08:04:45 PM Job step 'Statistics Rollup' failed (the error output was written to /opt/nethealth/log/Statistics_Rollup.100000.log Job id: 100000). ------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------------- From Statistic_Rollup.100000log: ----- Job started by Scheduler at '20/01/2000 20:00:54'. ----- ----- $NH_HOME/bin/sys/nhiRollupDb -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (20/01/2000 20:00:55). Error: Append to table nh_stats1_945467999 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 153 rows not copied because duplicate key detected. ). ----- Scheduled Job ended at '20/01/2000 20:04:46'. ----- ------------------------------------------------------------------------ ------------------------------------------------------------------------ --------------------------------------------- From the Statistics_Index.100005.log: ----- Job started by Scheduler at '21/01/2000 10:20:54'. ----- ----- $NH_HOME/bin/sys/nhiIndexStats -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (21/01/2000 10:20:55). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Fri Jan 21 03:21:09 2000) ). ----- Scheduled Job ended at '21/01/2000 10:21:18'. ------------------------------------------------------------------------ -------------- Output of nhiIndexDiag command (files at sparrow: /ftp/incoming) 30335_nhiIndex.log 30335_session.nh_stats0_948005999.dat System log file, errlog.log, Statistic_Rollup and Statistic_Index are in: //voyagerii/Escalated Tickets/30000/30335 Customer is SNS/Telkom SA 2-7-00 from RLT From the information on voyagerii, I think this customer has 2 problems. One is the standard dups in the stats tables. I'm a bit confused about whether the dups are in a stats0 or a stats1 table, from the logs, I think it is hte latter. Mike A is investigating what is going on with stats1 dups. so far we know it has to do with availabliilty backfill and the poller bouncing, but still need to determine where in the code the problem occurs. Please try a nhiIndexDiag -u nhuser -d nethealth again, I don't know why it had problems. It would be really helpful to get those duplicates. Are we certain the customer is at patch 10? The 2nd problem is the duplicate on insert problem. Please see problem 6824 for more information regarding that. Basically, this is a 'benign', but very annoying error. I would like to lo< ok at these tables though. I will forward Don the commands I would like run. 2-11-00 RLT Got list of tables from Sheldon. Customer had duplicates in 2 stats0 tables. They were a variabt of availability backfill. Sent script to clean up to Sheldon. Askewd for database and some log history so that we can see if this is one of the solved problems or if it is a new one. 5/5/2000 9:05:12 PM SysAdmin User has Network Health 4.5.1, Statistics Rollup and Exception report are failing due errors in DB, below are the errors in the log directory: -----Job started by Scheduler at '01/25/2000 20:00:31'.----- -----$NH_HOME/bin/sys/nhiRollupDb -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (01/25/2000 20:00:32). Error: Unable to execute 'CREATE TABLE nh_stats1_946616399 AS SELECT * FROM NH_RLP_STATS' (E_RD0060 Cannot access table information due to a non-recoverable DMT_SHOW error (Tue Jan 25 20:00:38 2000) -----Scheduled Job ended at '01/25/2000 20:00:38'.----- -----Job started by Scheduler at '01/26/2000 06:00:32'.----- -----$NH_HOME/bin/nhReport -scheduled -rptType health -rptName All-Exception-Rpt -subjType group -elemType lanWan -subjName All-LAN-WAN-Exceptions-Grp -uiNamesType names -autoRange yesterday -protocols all -namesType names -web $(SUBJECT)_$(DATE)_$(TIME) -webUser nethealth -jobId 1000005 -jobCount 23----- Fatal Error: Assertion for '_daCtrlsFile' failed, exiting (RuDaCtrlsResult::getDaCtrls must call readFile first in file ../RuDaCtrlsFile.C, line 641). Report failed. I showed these errors to Robin Trei, She asked to escalate inmediately A DB save and a copy of the whole log directory are on the ticket directory: 30416 Jan31. RLT: I ran a verifydb agains tthe database, and shipped information up to CA to investigate. The verifydb showed that somehow an entire column had gotten dropped (Not something that we do in our code.) There is no recovery from that, so I told Jose that the database would have to be recreated and reloaded. Engineering part in this problem is done. As per my conversation with Chris R I am closing this problem ticket. V5/5/2000 9:05:12 PM SysAdmin What the customer wants is a flag or option. Example: nhLoadDb -scheduleroff OR When the load completes it asks: Do you want to run scheduled jobs now? Y/N or You could allow a 1 hour grace period so that The customer can turn off the scheduled jobs if they so wish. Something would be better than nothing. 9/1/2001 4:19:14 AM AR_ESCALATOR Administrative change. This ticket has been closed because it has been open for more than one year. If this is still a requirement, it will need to be justified and re-submitted as an Enhancement Request. 5/5/2000 9:05:16 PM SysAdmin Conversations rollup failure. No error messages in any log except the system messages log. Customer has been experiencing error messages in the console stating that the conversations rollup has been failing. Examining the Conversations_Rollup.xxxxxxx.log and the Ingres errlog.log contain no information to debug the problem. Selecting the DB Status from the UI indicates that the last rollup failed. Files are available on voyagerii\tickets\31000\31358 cust is at+t wireless. 5/5/2000 9:05:17 PM SysAdmin After upgrade to NH 4.5 getting errors: Append to table nh_dlg0_950669999 failed. - NH 4.5, P11 & D08. - Solaris 2.6. - Upgrade had failed initialy, removed NH, installed 4.5 and loaded a saved 4.1.5 database. - Began polling with the loaded DB, started getting many Append to table errors during polling. - From the sysmessages log: - Tuesday, February 15, 2000 07:10:00 PM Error (nhiPoller[Dlg]) Append to table nh_dlg0_950669999 failed, see the Ingres error log file for more information. - No refrence to this message in the Ingres errlog.log file. ------------------------- I have the database on hendrix in the /export/hendirx2/call_tickets/31298_honeywell_nathan directory. The db archive has been unencrypted and is still in a tar file named 31298_feb.tar. ------------------------ ====================================================== rlt. I am looking into this. Originally, I thought it was a time zone problem, now I don't think so. (I must have made some errors when I first looked at the data, because my second look is turning up different dates.) We will need to get the dlg files from the $NH_HOME/tmp directory -- I have asked Brad Carey to look at the dlg 0 poller problems. see change history. ----- 03/24/00 Arlene Instrument the Poller. Added debug message to poller Created a new nhiPOller Customer needs to save current nhiPoller in $NH_HOME/bin/sys Shut down Network Health And put this nhiPoller into place. He needs to go into his startup.cfg in $NH_HOME/sys and shut off the dialog poller Next brting Network Health back up, the conversation poller will not come up. As network ehalth user go to $NH_HOME/bin/sys and type the following command: ./nhiPoller -dlg -Dm poller -Df Oo -Dt >& /tmp/poller.trc This will be writing 2-3 lines for each Probe for each Polling interval. Make sure /tmp has enough space to hold it. I need the append error to occur while this is running. It would be best if I got atleast 2 hours before and after the append error occured. Once this test is complete, send the poller.trc file, the contents of $NH_HOME/tmp and the system log Then: stop Network Health move your original Poller back into place change startup.cfg in $NH_HOME/sys back to its original state and restart network health The poller is on the ftp site: nhiPoller.7002 10/23/2000 4:29:47 PM akearns No feedback from customer - bug is not reproducible anymore y5/5/2000 9:05:18 PM SysAdmin Entire error: Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Sun Feb 27 20:01:43 2000)) No rows referenced in rollup log or error logs. Ingres errorlog.log error: 000001e6 Mon Jan 17 20:02:48 2000 E_CL103B_LK_RELEASE_BAD_PARAM LKrelease() failed due to a bad parameter, either the system could not find the lock (lkb = 0) or the lock found had an incorrect attribute (lkb_attribute = FFFFFFFF). errlog.log, iivdb.log, system event log, $NH_HOME/log/* found on voyagerII 4.5.1 P11 D07 Customer is Karl Oder at Rush St. Luke Medical Center 3/1/00 12PM - Jay Sent Tara script to get debug inforrmation from St. Luke Estimate: 10h 3/2/00 12PM - Jay System was NT, required script change 3/3/00 10AM- Jay Sent Tara update request 3/3/00 5PM - Jay Received data from St. Luke's. Followed up with a script that will get the day's worth of data in question. > Hello, > > I ftp'd the file to ftp.concord.com since it is to big to > email. I placed it > in the incoming directory with the name RPSLMC_concordDbg2.tar. > > Regards, > > Karl Oder 3/5/00 10:40AM - Jay Received tar file from St. Luke's. Investigating... 3/7/00 11AM - Jay Provided Stephanie script to clean up database. This problem came about as a result of the issues being dealt with in 6966. 3/8/00 11AM - Jay Sent email to Stephanie requesting status. 3/14/00 - Jay Closing this issue per Stephanie. Customer is cleaned up with script. I am pushing for the ultimate fix (ProbT6966) to go into patch 13. 5/5/2000 9:05:18 PM SysAdmin Customer: IRS Network Health 4.1.5 P11 OS: Solaris 2.6 User became aware that the conversations rollup were failing because in the Database status from the console saids: 60 Probes 145579 Nodes As polled 3.887 Gb The latest entry was tuesday at 8:30 a.m. Last rollup failed I noted that ingres errlog.log stops reporting on: Wed Feb 23 13:51:16 The system log shows: Tuesday, February 29, 2000 12:05:39 AM Starting job 'Conversations Rollup' . . . (Job id: 100001, Process id: 7560). Tuesday, February 29, 2000 12:05:52 AM Job step 'Conversations Rollup' failed (the error output was written to /opt/neth/log/Conversations_Rollup.100001.log Job id: 100001). The log file do< es not specify the reason of the failure: ----- Job started by Scheduler at '02/29/2000 04:05:02 PM'. ----- ----- $NH_HOME/bin/sys/nhiDialogRollup -u $NH_USER -d $NH_RDBMS_NAME ----- ----- Scheduled Job ended at '02/29/2000 04:05:13 PM'. ----- This customer's DB is getting huge (~4Gb ) and the disk space is about 92% utilization. Customer also has this type of messages: Tuesday, February 29, 2000 09:34:45 AM Error (nhiPoller[Dlg]) Append to table nh_dlg0_951843599 failed, see the Ingres error log file for more information (E_CO0048 COPY: Copy terminated abnormally. 0 rows successfully copied because either all rows were duplicates or there was a disk full problem. A bunch of collected files are on: //voyagerii/Escalated Tickets/31000/31677/ 5/5/2000 9:05:18 PM SysAdmin rom ingres error log: NETWORK ::[59492 , 00001fc0]: Mon Feb 28 15:36:34 2000 E_CL1004_LK_DEADLOCK Deadlock detected NETWORK ::[59492 , 00001fc0]: Mon Feb 28 15:36:34 2000 E_DM9045_TABLE_DEADLOCK Deadlock encountered locking table neth.nh_stats0_951767999 in database nethealth with mode 5. Resource held by session [27856 1fc0]. NETWORK ::[59492 , 00001fc0]: Mon Feb 28 15:36:34 2000 E_DM0042_DEADLOCK Resource deadlock. NETWORK ::[59492 , 00001fc0]: Mon Feb 28 15:36:34 2000 E_QE002A_DEADLOCK Deadlock detected. Customer is on P10 D07 . I had him grep for ingres to see if there was the multiple DB's running. There weren't 5/5/2000 9:05:18 PM SysAdmin Customer would like to have the option that will allow a user to select whether it must do full integrity check on the database or do what it does now. The reason why he is asking is that they have issues where the database backup log file says that the db backup went through fine and then if you do a CRC test on the files or load the database into a server it will sometimes fail. I was explained to the customer that when the db is loaded, Nethealth will do a full integrity check and thats why it takes so long to do a restore, but when the backup occurs it is not doing a full integrity check. .. 9/20/2002 8:17:02 AM lmcquade Denise Brooks requested the status be set from Archive to Closed per Peggy Kuehne request. 5/5/2000 9:05:18 PM SysAdmin .See esc. tickets directory for logs Network Health prints the following error log after a statistics rollup: > >----- >Job started by Scheduler at '03/02/2000 17:20:20'. >----- > >----- >$NH_HOME/bin/sys/nhiIndexStats -u $NH_USER -d $NH_RDBMS_NAME >----- > >Begin processing (03/02/2000 17:20:20). >Error: Sql Error occured during operation (E_US1592 INDEX: table could not >be in >dexed because rows contain duplicate > keys. > (Thu Mar 2 12:20:29 2000) see change log 5/5/2000 9:05:18 PM SysAdmin Customer would like the -backup option from a command line dbSave to be added as a check box function to the GUI dbSave. WTM - Add this to 5.0 9/1/2001 4:19:14 AM AR_ESCALATOR Administrative change. This ticket has been closed because it has been open for more than one year. If this is still a requirement, it will need to be justified and re-submitted as an Enhancement Request. 5/5/2000 9:05:18 PM SysAdmin This is State Farm nh00c12 the rollup keeps failing with an error in the log stating Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Thu Mar 2 17:32:26 2000) Beth Lord gave State Farm CD equivalent to 4.6h release Customer says problem is resolved. 8/22/2000 10:02:34 AM bhinkel Changed status to Closed, as the fix was included in a product release (4.6). 5/5/2000 9:05:18 PM SysAdmin NH version 4.5 statistics_rollup log: Begin processing (03/06/2000 07:00:43 PM). Error: Append to table nh_stats1_950331599 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 94931 rows not copied because duplicate key detected.) No entries in the errlog.log file. Contents fo the NH_HOME/tmp directory: nh_home_tmp_to_concord_0307.tar.Z help\g output in the file: tables.out_0307.wri nhiIndexDiag did not return any output. All files in the Esclated Tickets directory on voyagerII/30000/32155 patch level 2 I am having them patch to 11. Please review info in change history closed as support closed Call Ticket. 5/5/2000 9:05:19 PM SysAdmin Error from failed Statistics_Index and Statistics_Rollup logs: Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. df -k output: Filesystem kbytes used avail capacity Mounted on /dev/dsk/c0t0d0s 0 3530595 432312 3062978 13% / /proc 0 0 0 0% /proc fd 0 0 0 0% /dev/fd /dev/dsk/c0t0d0s3 3530595 3060222 435068 88% /var /dev/md/dsk/d2 17140706 14412440 2556859 85% /opt swap 1576472 960 1575512 1% /tmp NH versioin 4.5.1 with P11/D08 Solaris 2.6 on Sun Enterprise 450 with 512 MB RAM, 1.5 gig swap Related files: errlog.log, syslog, failed Statistics_Index and Statistics_Rollup logs are in Escalated Tickets/ 32000/32223 5/5/2000 9:05:19 PM SysAdmin Statistics_Rollup log Error: Append to table nh_stats1_951541199 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 101203 rows not copied because duplicate key detected. Output of nhiIndexDiag command: Table is lacking an index. Duplicate problem: Found 0 duplicates out of 501 rows for index job_schedule_ix on table nh_job_schedule. Analysis of indexes on database 'nethealth' for user 'neth' completed successfully. Ingres errlog.log: Wed Feb 16 03:17:06 2000 E_SC0216_QEF_ERROR Error returned by QEF. Wed Feb 16 03:17:06 2000 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. Thu Feb 17 18:06:21 2000 E_CLFE06_BS_WRITE_ERR Write to peer process failed; it may have exited. System communication error: Broken pipe. (last message in this log) System messages: multiple instances of "Warning (nhiPoller[Net]) The database space is getting low, you should make more space available." Output of df -k command: Filesystem kbytes used avail capacity Mounted on /dev/dsk/c0t0d0s6 2507805 21582 2436067 1% / /dev/dsk/c0t0d0s3 1616095 399893 1167720 26% /usr /dev/dsk/c0t0d0s4 963869 275788 630249 31% /var /dev/dsk/c0t9d0s0 8705501 3748765 4869681 44% /copy /dev/dsk/c0t0d0s0 1490275 240890 1189774 17% /home /dev/dsk/c0t8d0s0 8705501 8055051 563395 94% /opt /dev/dsk/c0t0d0s5 963869 9 906028 1% /stand swap 1117936 548912 569024 50% /tmp Note: All nethealth and db are on /opt Related files in Escalated Tickets directory /32000/32185 C5/5/2000 9:05:19 PM SysAdmin Error message in stats rollup log: Error: Sql Error occurred during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. Error is stats index log: Error: Sql Error occurred during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. No duplicate errors in errlog.log Database status dialog indicates rollups have failed. Customer has had this problem twice before - please see tickets 29217 and 31573. This seems to happen every few weeks at this site. d5/5/2000 9:05:19 PM SysAdmin Database is inconsistant after system crash. Excerpt from infodb.txt; The Database is Inconsistent. Cause of Inconsistency: UNDO_ERROR The Database is not Journaled. Journals are not valid from any checkpoint. Excerpt from errlog.log; E_DM0152_DB_INCONSIST_DETAIL Database nethealth is inconsistent. NETMGRN3::[II\INGRES\177 , 000001ae]: Fri Mar 10 10:14:27 2000 E_SC0121_DB_OPEN Error op< ening database. Name: nethealth Owner: nethealth Access Mode: 00000002 Flags 00000000 NETMGRN3::[II\INGRES\177 , 000001ae]: Fri Mar 10 10:14:27 2000 E_SC010D_DB_LOCATION Database Location Name: $default Physical Specification: d:\nethealth\oping\ingres\data\default\nethealth Flags: 00000003 All related log files are on voyager_II in 32000/32286 &5/5/2000 9:05:20 PM SysAdmin Rollups have been failing. Attempted to clean out duplicate keys, but they still fail. Today the DB went inconsistant. Patch 11. sent him the clean script from Jay and we will see how this pans out. 1hA Closing as the associated call ticket is closed. 5/5/2000 9:05:21 PM SysAdmin Excerpt from the call ticket: The customer just completed a load for the second time using a different 4.1u database and got the same problem.. the poller config dialog is showing the speed in the index column and the name in the agent type column.. the poller status stays yellow and any attempt to modify an element causes the GUI to hang. We are still seeing the following in the console GUI log: Wednesday, 15/03/2000 17:05:35 System Event The server is not running, starting server . . . Wednesday, 15/03/2000 17:05:46 System Event Console initialization complete. Wednesday, 15/03/2000 17:05:47 Internal Error (Configuration Server) Expectation for '!_initCbObj && !_initCbRtn' failed (CdtDbElemTrans::initTrans in file ../CdtDbElemTrans.C, line 158). (cu/) Related files on \\voyagerii\32365 manthony - 3/21/00 Received a flurry of mis-information from reseller who is go between for Support and Customer. Apparently reseller did NOT upgrade remote polling installations ONLY upgraded central site. Customer decided on Friday 3/17/00 to upgrade to 4.5.1 p10 instead of 4.6. Customer support is going to hold resellers hand to make sure that the upgrade is done properly. This issue is being closed. n5/5/2000 9:05:23 PM SysAdmin /opt/concord/nethealth/log/Conversations_Rollup.100001.log: > > > ----- > $NH_HOME/bin/sys/nhiDialogRollup -u $NH_USER -d $NH_RDBMS_NAME > ----- > > Begin processing (2000/17/03 08:05:40). > Error: Unable to execute 'MODIFY nh_dlg1s_941266799 TO BTREE > UNIQUE ON > sample_t > ime, dlg_src_id, nap_id, proto_id WITH FILLFACTOR = 100, > LEAFFILL = 100, > NONLEAF > FILL = 100' (E_US1591 MODIFY: table could not be modified because rows > contain > duplicate keys. > (Fri Mar 17 11:05:50 2000) > ). Rollups Ran successfully at the customer site closed 5/5/2000 9:05:23 PM SysAdmin Statistics_Rollup fails as follows: Begin processing (03/16/2000 08:00:18 PM). Error: Append to table nh_stats1_952754399 failed, see the Ingres error log file for more information (E_CO0048 COPY: Copy terminated abnormally. 0 rows successfully copied ecause either all rows were duplicates or there was a disk full problem. Ingres errlog has deadlock errors: see escalated tickets/32000/32617 Fri Feb 25 02:21:12 2000 E_DM9045_TABLE_DEADLOCK Deadlock encountered locking table health.nh_stats0_951461999 in database nethealth with mode 5. Resource held by session [510 11f9]. Output of "nhiIndexDiag -u -d " shows duplicates - see escalated tickets/32000/32617 Customer is at patch level 11 Related files in escalated tickets: System messages, errlog, failed rollup log, nhiIndexDiag_output 3/21/00 - manthony Customers DB has been cleaned. Closing this issue 5/5/2000 9:05:23 PM SysAdmin CbaBaseApp::invokePgm parses the command line from the database using char's. This causes Japanese character strings to get messed up and scheduled reports to fail. Don't parse characters one char at a time. Use wide strings. 5/5/2000 9:05:23 PM SysAdmin Roll ups are failing due to duplicate table" nh_stats1_940111199 failed" The following files are in Voyager _II esc. tickets 32000/32465; errlog.log stat's roll up log everything in $NH_HOME/tmp Statistics_Index.100005.log Analysis.txt The output of echo "help\g" | sql nethealth > tables.out Sent worksheet to Support, requested some additional info. They should be able to handle from here. (Info is just for analysis) Am closing as unable to get info from customer. CUstomer is back up and running. Ultimately, this bug will be handled by Mike A's fix change to rollup transaction logic schedueld for 4.7 n5/5/2000 9:05:23 PM SysAdmin Rollup error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. Stats index error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. Output of indexDiag.out.txt: Index 'nh_stats0_951152399_ix1' was not not in the database. Duplicate problem: Found 18 duplicates out of 27577 rows for index nh_stats0_951152399_ix1 on table nh_stats0_951152399. Ingres errlog: E_QE0024_TRANSACTION_ABORTED The transaction log file is full. The transaction will be aborted. See escalated tickets/32000/32692 for log syslog, errlog, rollup log, stats_index log, data_analysis log, output of indexDiag.out, and output of df -k command. Customer running 4.5 P10 on Solaris 2.5.1 rlt. asked for more data.. 5/5/2000 9:05:23 PM SysAdmin assigned On Wednesday night, the rollup failed with the following error: ============================================================================ = >----- >Job started by Scheduler at '15/3/2000 22:00:59'. >----- >----- >$NH_HOME/bin/sys/nhiRollupDb -u $NH_USER -d $NH_RDBMS_NAME >----- >Begin processing (15/3/2000 22:01:00). >Error: Unable to execute 'DROP TABLE nh_stats0_947761199' (E_US125C Deadlock detected, your single or multi-query >transaction has > been aborted. > (Wed Mar 15 22:06:53 2000) >). >----- >Scheduled Job ended at '15/3/2000 22:06:54'. >----- ============================================================================ The result of this was that on Thursday night, the rollup failed again with the standard error we've been experiencing: >----- >Job started by Scheduler at '16/3/2000 22:00:09'. >----- >----- >$NH_HOME/bin/sys/nhiRollupDb -u $NH_USER -d $NH_RDBMS_NAME >----- > >Begin processing (16/3/2000 22:00:11). >Error: Append to table nh_stats1_947800799 failed, see the Ingres error log file for more information (E_CO0048 COPY: Copy >terminated abnormally. 0 rows successfully copied > because either all rows were duplicates or there was a disk full problem. >). >----- >Scheduled Job ended at '16/3/2000 22:00:21'. >----- Related files are in \\voyagerii\32000\32178\March21 > assigning to mike - jkelly 3/22/2000 I took it, my week. Was already working it. rlt. 3/23/00. 5/5/2000 9:05:24 PM SysAdmin iidbdb is inconsistant. - NH 4.5 P4/D03 - HP-UX - Error messages from errlog.log file: - E_SC0215_PSF_ERROR Error returned by PSF. E_DM9331_DM2T_TBL_IDX_MISMATCH Base table missing or indices missing. E_DM9C8B_DM2T_TBL_INFO An error occurred while attempting to build the Table Control Block for table (17825,0) in database nethealth. E_DM9C89_DM2T_BUILD_TCB An error occurred while building a Table Control Block for a table. E_DM9C8A_DM2T_FIX_TCB An error occurred while trying to locate and/or build the Table Control Block for a table. E_RD0060_DMT_SHOW Cannot access table information due to a non-recoverable DMT_SHOW error E_DM010B_ERROR_SHOWING_TABLE An error occurred while showing information about a table. E_RD0060_DMT_SHOW Cannot access table information due to a non-recoverable DMT_SHOW error E_PS0904_BAD_RDF_GETDESC RDF error occurred when getting description for an object. E_PS0007_INT_OTHER_FAC_ERR PSF detected an internal error when calling other facility. E_SC0215_PSF_ERROR Error returned by PSF. - E_DM9327_BAD_DB_OPEN_COUNT The database open count was greater than zero for the first opener of the database, therefore the database must be considered inconsistent. This can happen when the log file was not readable< ; therefore, if any transactions were in progress, they could not be backed out. This database must be rolled forward from a checkpoint, or destroyed and recreated. E_DM0100_DB_INCONSISTENT Database is inconsistent. E_DM0152_DB_INCONSIST_DETAIL Database iidbdb is inconsistent. E_SC0121_DB_OPEN Error opening database. Name: iidbdb Owner: $ingres Access Mode: 00000002 Flags 00000000 E_SC010D_DB_LOCATION Database Location Name: $default Physical Specification: /opt/local/nethealth/idb/ingres/data/default/iidbdb Flags: 00000003 E_US002B Could not open the iidbdb database. E_DM0100_DB_INCONSISTENT Database is inconsistent. E_US002B Could not open the iidbdb database. E_SC0123_SESSION_INITIATE Error initiating session. ========================================================================================================================= Gave shane worksheet. Reviewed what data had been collected. This looks like the 'hard shutdown, no other cause' category. As such, asked Shane to get some more data for me, as this is our last remaining target category. see change history. 5/5/2000 9:05:24 PM SysAdmin ######################################################################################################## Conversations Rollup Failure ================================================================================ Error message from Conversation Rollup log: Begin processing (03/22/2000 06:20:45 PM). Error: Append to table nh_dlg1s_953096399 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 4102 rows not copied because duplicate key detected.). ================================================================================= Asking customer for the following per Robin Trei's DB troubleshooting steps: All the logs in $II_SYSTEM/ingres/files/*.log All the files in $NH_HOME/logs System messages log. ######################################################################################################## awaiting info. rlt. 2/23/00 see change history.... 1hA talked with Bob. Everything looks good except that the customer thinks the nodes aren't being deleted at midnight because the numbers are growing. (NO errors in the log though.) Asked Bob K to put a trace on it for tonight, while I investigate on this end. ************* Thisi s Paine Webber. Original problem solved long ago. They say that there midnight rollups are not deleting the nodes they are supposed to. Apparently, there are no error messages. (Asked for but never recieved any logs.) I've explained to Bob to pass along that the nh_node_addr_pairs are not deleted until they are not referenced in *any* of the dialog tables. The nh_node_addrs and nh_elements are not deleted until the nh_node_addr_pairs are deleted. I can't do anything more without a database. 5/10/2000 11:02:40 AM rtrei this does not require a code change 5/11/2000 10:56:03 AM rkeville Config. - NH 4.6. - Server=concord7 - UE 250 - 2x400 Mhz processors. - 512 Mb RAM ===================== 5/15/2000 3:57:27 PM rtrei Have database, investigating. 5/17/2000 9:17:46 AM rtrei Elena is correct, nodes aren't being dereferenced. At least, there are a huge number that should have been dereferenced and probably weren't due to the earlier problems. If the dialog rollups fail (because of dups, etc), it does not go back in time to delete any it might have missed. I will be submitting a remedy ticket to correct this in the future. (Patch or next release). In the interim, I am working on a script to clean this database up. 5/18/2000 6:33:27 PM smcafee Moved Low priority issues to postponed. Needs review for readme. 5/19/2000 8:17:32 AM smcafee Not a 4.7 issue. Should not have been postponed. 6/2/2000 12:26:24 PM rtrei Closing as associated cal ticket finally closed. [5/5/2000 9:05:24 PM SysAdmin Knows that you can find the location of the transaction log in the install.cfg but: When running the nhDbStatus AND in the GUI Database Status would like to see the location of the transaction log. Evaluate for 5.0 - imp500 5/18/2000 5:18:00 PM lincoln changes status 5/26/2000 2:18:35 PM lincoln changed status 9/1/2001 4:19:16 AM AR_ESCALATOR Administrative change. This ticket has been closed because it has been open for more than one year. If this is still a requirement, it will need to be justified and re-submitted as an Enhancement Request. 5/5/2000 9:05:25 PM SysAdmin The customer would like to be able to use groups (and possibly groupLists) in the plannedDowntime config textfile (in addition to the current element basis). This would enable them to tell Network Health that they have a service window for their entire network by adding one line for the ALL group. This would be MUCH easier than adding their 4000 elements to the file. Customer is using plannedDowntime to eliminate Health reports from reporting on Availability exceptions. Customer: SE-Bank Reseller: Cygate WTM - Evaluate for 5.0 - imp500 5/18/2000 5:18:00 PM lincoln changes status 5/26/2000 2:18:36 PM lincoln changed status 7/30/2001 12:58:17 PM jkaufman Currently scheduled to be added in release 6.0 9/1/2001 4:19:16 AM AR_ESCALATOR Administrative change. This ticket has been closed because it has been open for more than one year. If this is still a requirement, it will need to be justified and re-submitted as an Enhancement Request. 5/5/2000 9:05:25 PM SysAdmin Roll ups are failing due to duplicate table" NH_stat1_951865199 failed " The following files are in Voyager _II esc. tickets 32000/;32752 errlog.log stat's roll up log everything in $NH_HOME/tmp Analysis.txt The output of echo "help\g" | sql nethealth > tables.out manthony - 4/14/00 Customer successfully able to run rollups now. Closing this issue. 5/5/2000 9:05:25 PM SysAdmin Statistics_Rollups log: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. Conversations_Rollups log: Unable to execute 'MODIFY nh_dlg1s_944456399 TO BTREE UNIQUE ON sample_t ime, dlg_src_id, nap_id, proto_id WITH FILLFACTOR = 100, LEAFFILL = 100, NONLEAFFILL = 100' (E_US1591 MODIFY: table could not be modified because rows contain duplicate keys. IndexDiag.out log: Problem encountered with analyzing table Error: Indexing problem: nh_node_addr_pair should have been btree but was HEAP. Table is lacking an index. Duplicate problem: Found 43318 duplicates out of 1364461 rows for index nh_node_addr_pair_ix2 on table nh_node_addr_pair. Saving duplicates in '/opt/health/tmp/session.nh_node_addr_pair.dat' Index_Statistics and Data_Analysis are fine, no errors in Ingres errlog.log since Feb 18th. FetchDb is also failing. See Escalated Tickets\33000\33003 for related files. Note that all log files are in NH_HOME_Logs.log file. Customer is running NH version 4.5.1 P08 on Solaris 2.6 Call ticket 33003 manthony 4/26/00 Customer has not responded to this issue for almost a month due to a flury of other problems he has been having with nethealth. We will need to basically start from scratch. Requesting network health log files. Awaiting feedback. 5/10/2000 10:53:57 AM manthony Customer has NOT resonded in about a month to queries. Closing this issue. 5/5/2000 9:05:25 PM SysAdmin Customer noticed that rollups were failing. Statistics_Rollup.1000000.log shows: Job started by Scheduler at '03/23/2000 12:00:40 AM'. ----- ----- $NH_HOME/bin/sys/nhiRollupDb -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (03/23/2000 12:00:42 AM). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Thu Mar 23 00:06:50 2000) ). ----- Scheduled Job ended at '03/23/2000 12:06:53 AM'. ******************************************************************************* Ran cleanStats > dupInfo.out found the following nh_stats1_945061199 nh_stats2_945579599 Customer is running 4.< 5.1 P10 D7 Files in \\Voyageri\Escalated Tickets\32000\32834\ directory: 32834.log System messages dupInfo.out cleanStats results errlog.log Ingres error log nhiRollupDb_9157.txt Advanced logging on his own he ran this for two days and got a 70 meg text file) Statistics_Rollup.100000.log File shoing error message manthony 3/28/00 Asked support to get help\g output. Awaiting info. manthony 3/29/00 Customer support sent customer a script to clean DB. The rollups are now working for the customer. Closing this issue. 5/5/2000 9:05:26 PM SysAdmin Statistics_Index and Statistics_Rollups fail with same error message: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. Data Analysis, DbSave and Conversations Rollups are fine. The output from nhiIndexDiag command shows multiple duplicates in the database. Last entry in ingres error log is: Fri Feb 11 07:44:40 2000 E_SC0129_SERVER_UP Ingres Release OI 2.0/9712 (su4.us5/00) Server -- Normal Startup. System messages, df -k output, log files are all in Escalated Tickets/32000/32946 Note: Ingres error log is forthcoming. See Health.1000069.log for last 50 lines of ingres errlog. Customer is running on Solaris 2.6, NH version 4.5 with P11 and D08. This is the central site of a distributed polling environment. manthony 3/28/00 Asked customer support to runa a script that will clean DB and asked them to drop a stats1 table that will allow rollups to run. Also asked to get the customer to patch 12 to fix poller crashing. Awaiting feedback.. manthony 3/31/00 Customer support stated that customer's DB is cleaned. closing this issue. 5/5/2000 9:05:26 PM SysAdmin Customer was having problems with disk space (see system messages in Escalated tickets directory). He was trying to reclaim some space by changing the rollup schedule for daily data fro 70 to 60 weeks and then run rollups manually. He gets the following eror in Statistics_Rollup.1000000.log: Begin processing (03/28/00 09:55:00). Error: Append to table nh_stats1_953787599 failed (see the Ingres error log file for more information). He is running 4.1.5 P13 D11 on a HP-UX with OS 10.20. Files in \\Voyagerii\Escalated Tickets\33000\33075\ directory: sys_messages.txt bdf.txt discoverResults.log errlog.log Statistics_Rollup.100000.log see change history. (5/5/2000 9:05:27 PM SysAdmin Statistics_Rollup: Error: Unable to execute 'DROP TABLE nh_stats0_953917199' (E_US125C Deadlock detected, your single or multi-query transaction has been aborted. Ingres errlog.log: Thu Mar 16 11:18:46 2000 E_DM0100_DB_INCONSISTENT Database is inconsistent. Thu Mar 16 11:18:46 2000 E_DM0152_DB_INCONSIST_DETAIL Database nethealth is inconsistent. Thu Mar 16 11:18:46 2000 E_SC0121_DB_OPEN Error opening database. Name: nethealth Owner: health Access Mode: 00000002 Flags 00000000 Thu Mar 16 11:18:46 2000 E_SC010D_DB_LOCATION Database Location Name: $default Physical Specification: /usr2/concord/idb/ingres/data/default/nethealth Flags: 00000003 output df -k: ndcnms1% df -k Filesystem kbytes used avail capacity Mounted on /dev/md/dsk/d2 48023 27628 15593 64% / /dev/md/dsk/d14 480919 383736 49092 89% /usr /dev/md/dsk/d8 192807 99012 74515 58% /var /dev/md/dsk/d17 262159 14367 221577 7% /home /dev/md/dsk/d11 480919 87458 345370 21% /opt swap 371960 120 371840 1% /tmp /dev/md/dsk/d0 7867296 5425197 1655370 77% /usr2 - this is where nethealth lives Notes: Stats_Index completes fine Data_Analysis completes fine All related files including output of nhiIndexDiag in Escalated Tickets\33000\33133 Customer is running NH 4.5.1 with P11/D08 on Solaris 2.6 - remote poller in a distributed polling environment. ########################################################################################################################################################################### NOTE: SEE THE CHANGE HISTORY SECTION ############################################################################################################################################################################ 1hA 5/17/2000 9:47:36 AM rtrei closing as call ticket is closed. 5/5/2000 9:05:28 PM SysAdmin Roll ups are failing due to duplicate table table nh_dlg1s_953701199 The following files are in Voyager _II esc. tickets 33000/33200 errlog.log stat's roll up log everything in $NH_HOME/tmp Statistics_Index.100005.log Analysis.txt The output of echo "help\g" | sql nethealth > tables.out manthony 3/31/00 Getting an NT version of the nhiDialogRollup built. manthony 4/3/00 Testing the nhiDialogRollup.exe on test data we have in house revealed that the recursion is blowing the stack. Requested customer's database to make sure that the rollups will succeed. manthony 4/7/00 Customer reported that rollups are now working fine. Closing this issue. i5/5/2000 9:05:28 PM SysAdmin Statistics_Rollup log: Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. Ingres errlog TCONOC13: Mon Mar 20 15:43:17 2000 E_US0014 Database not available at this time. The database may be marked inoperable. This can occur if CREATEDB failed. An exclusive database lock may be held by another session. The database may be open by an exclusive (/SOLE) DBMS server. Mon Mar 20 15:43:17 2000 E_SC0123_SESSION_INITIATE Error initiating session. Tue Mar 21 17:00:39 2000 E_SC0216_QEF_ERROR Error returned by QEF. Tue Mar 21 17:00:40 2000 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. Associated error messages can be found in the error log, Mon Mar 27 14:36:00 2000 E_CLFE06_BS_WRITE_ERR Write to peer process failed; it may have exited. Mon Mar 27 14:36:01 2000 E_QS001E_ORPHANED_OBJ An orphaned Query Plan object was found and destroyed during QSF session exit Mon Mar 27 14:36:01 2000 E_QS0014_EXLOCK QSF Object is already locked exclusively. System log, ingres errlog, failed rollup log in Escalated Tickets\33000\33152 Customer running NH 4.5.1 with P12/D08 on Solaris 2.6 Read call log. Not sure exactly what has been done. Sent mail to support.. 1hA closing as associated call ticket is closed. 5/5/2000 9:05:29 PM SysAdmin Customer is running 4.5 with patch 10. Statistics rollups are failing: ----------------------- Error: Append to table nh_stats1_952405199 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 10 rows not copied because duplicate key detected.) From the Statistics Index log: ----------------------- Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys.) ---------------------------- No entries in the ingres error log file. Errlog.log, stats rollup/index log, syslog are on voyagerii/tickets/33000/33163 ---------------------- see change history... j5/5/2000 9:05:29 PM SysAdmin From: Paul Ratajczyk [mailto:pratajczyk@concord.com] Sent: Thursday, March 30, 2000 11:59 PM To: support@concord.com Cc: 'Black, Darrell (NWA)'; 'Worthley, Cheryl' Subject: Enhancement Request Hello, During a conversation with Northwest Airlines, they conveyed some enhancement possibilities for future releases. Could you please enter these requests and provide a tracking number. 3. Creating the system log as a file, then parsing it off and saving if for the last 7 days, in a rolling cycle would be advantageous for troubleshooting in the future. 5/17/2000 1:39:44 PM lincoln (imp500) 5/17/2000 2:04:52 PM lincoln changed< status 5/18/2000 5:18:01 PM lincoln changes status 5/26/2000 2:18:36 PM lincoln changed status 7/30/2001 12:59:22 PM jkaufman Implemented in the 5.0 release (5/5/2000 9:05:29 PM SysAdmin ICS would like any tables that are not completely copied during a failed rollup to be cleaned automatically. They feel the rollup should go in, see if any of the tables are not completely rolled up, drop them, then proceed with the next rollup. 5/23/2000 10:09:43 AM rtrei With the change to 4.7, each table rollup is handled in a single transaction: the table is copied, indexed and the old table dropped as a single transaction. It will either all complete or all rollback. If it rollbacks, everything is set to restart once rollups happen again. Sorry for the use of rollback and rollups in the same sentence. I am marking this as a repeat. This work was done by MIke A on one of his tickets. It is already had the required checkin information, etc. done. 5/5/2000 9:05:29 PM SysAdmin nhiLoadDb fails with message: E_CO0039 Error processing row 1. Cannot convert column 'poll_rate' to tuple format. - NH 4.6 P02/D01. - Solaris 2.51. - Upgraded from running 4.5 install. - After the upgrade failed, had the customer reinstall Ingres and attept to load a saved database. - Excerpt of error messages from load.log: - Loading table nh_hourly_health . . . Loading the sample data . . . Updating a prior version 4.5 database . . . Begin processing Copying new fields into the Database (03/31/2000 20:07:03). Clearing NMS key field on at-gw1-rcn-RH-Cpu-1-- length was too long. Please rediscover. Clearing NMS key field on PL-O-AT-CKSONU-3-- length was too long. Please rediscover. Clearing NMS key field on PL44903384-O-AT-OCMBOSES-1-- length was too long. Please rediscover. Fatal database error: Step 8 in rev 22 31-Mar-2000 20:08:26 - Database error: -33000, E_CO0039 COPY: Error processing row 1. Cannot convert column 'poll_rate' to tuple format. Load of database 'nethealth' for user 'neth' was unsuccessful. End processing (03/31/2000 20:12:08). Internal Error: Expectation for 'sqlca.sqlcode == 0'failed (disconnecting from the database in file ./duDatabaseSql.C, line 99). (cu/cuAssert) ================================================================= see change history.. 1hA This was due to a '|' character being in the community string. It turns out htis is a legal character for community string. The specifics of this problem was fixed by the recent changes to nh_convert_db. However, still need to investigate that ascii saves allows one to designate charcter before close this completely. closing: if this problem occurs in any existing databases the customer can designate a different delimter to use for the save and load by setting NH_DB_ASCII_DELIM . 5/5/2000 9:05:30 PM SysAdmin Error in statistics rollup log Error: Append to table nh_stats1_953848799 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 67128 rows not copied because duplicate key detected.). -------------------------- Stats rollup log, Stats index log, errlog.log and console messages are in voyagerii\tickets\33000\33356 Customer is on Patch 10, dev cert 8 and running Solaris 2.6 Askewd Sheldon for Timezone to correlate errlog with system console messages. Looks like straightforward stats1 append prob. Need to identify reason why rollups failed. 1hA Closing this ticket, associated call ticket is closed. W 5/5/2000 9:05:31 PM SysAdmin Excerpt from call ticket: NOTE: This is not a regular "duplicates in a table" problem. This is connected to a severe problem in the Rollup mechanism. Please forward this problem to engineering ASAP. NOTE2: We've fixed the problem at the customer's site by dropping 24 stats0 tables, but since this problem has occured the second time at this customer's site and also a other customers' sites, we need to get a fix for whatever is causing this problem. PROBLEM: The daily statistics rollup job fails with error message: Error: Append to table nh_stats1_951519599 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 5262 rows not copied because duplicate key detected. ). Analysis has shown, that the database contained several time-overlapped stats tables, e.g. the database contained both the table nh_stats1_951519599 (hourly samples for 02/25/2000) as well as 24 tables nh_stats0_951436799 (as-polled samples for 02/25/2000 00:00:00-00:59:59) [...] nh_stats0_951519599 (as-polled samples for 02/25/2000 23:00:00-23:59:59) each of them containing sample data (i.e. time overlaps all day on February 25th). Obviously, a previous Rollup job left the database in this inconsistent state for some reason, and all following Rollup jobs failed to correct this inconsistency. This problem has occured at this customer's site for the second time, and it has happened at other customer's site, too. We need a fix for this soon! We've uploaded an archive to ftp://ftp.concord.com/incoming/ics001162-1.tar which contains the rollup log, a list of overlapping RLP entries (as well as the script, that we've used to find them) and ascii dumps (using select, not copy) of iitables, nh_rlp_boundary and all the statistics tables, that were mentioned above. NOTE3: After verifying, that the stats1 table contains *all* rolluped data for 02/25/2000, we removed the stats0 tables from the database and RLP table. A succeeding Rollup was successful - but we still need a fix for the original problem, so that inconsistencies like this will never show up again. Related files are on voyagerii\33000\33192 SWC Will the procedure in the Database support document apply in this case? see change history... @d@rt@1hA Closed ticket as call ticket closed. |5/5/2000 9:05:32 PM SysAdmin The Statistics Rollup job fails on the Central server with the following error message: Begin processing (04/04/2000 23:28:20). Error: Append to table nh_stats1_951688799 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 276 rows not copied because duplicate key detected. ALSO: When the nhFecthDb runs the import of the data to the database fails. (See attached the nhFetchDb log files) When the import fails, it corrupts the poller configuration which causes the NH server to stop. The error message are as follows : Fatal Internal Error ( Configuration Server ) File /opt/nethealth/poller/poller.cfg, Line 8894: Found 'EOL' where a '}' is expected. (cdm/) The import of the data fails mainly due to duplicate keys (Again look at the nhFecthDb log files) Checking for duplicate element names and inserting elements ... ************************************************************ * ERROR: Duplicate element ID. *----------------------------------------------------------- * Importing failed due to a duplicate element ID. * Call technical support. ************************************************************ Error: Unable to execute 'INSERT INTO nh_element SELECT * FROM nht_dst_element' (E_US1194 Duplicate key on INSERT detected. (Wed Apr 5 06:15:47 2000) ). Adding remote element association, element alias and latency data ... Adding element data from files lat_b41/nea_b23/els_b45/mtf_b45 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1997 Computer Associates Intl, Inc. (0 rows) E_CO003F COPY: Warning: 7167 rows not copied because duplicate key detected. E_CO0028 COPY: Warning: Copy completed with 1 warnings. 6551 rows successfully copied. See the entire log directory from the central machine in \\Voyagerii\Escalated Tickets\33000\33480 See the remote save logs from the reote machines in \\Voyagerii\Escalated Tickets\33000\33480\Remote\remotePoller\nethealth\ directory THe stats1 problem has been fixed. The split/merge problem was a no-bug. THat is, it was caused by user error. (They gave 2 different sites the same sy< stem id number, which is not allowed in the documentation.) 6/1/2000 11:06:58 AM schapman -----Original Message----- From: Pieter Becker [mailto:pieter@snscon.co.za] Sent: Thursday, June 01, 2000 9:55 AM To: Support List (E-mail) Subject: Ticket 33480 Sheldon, hi The Statistics rollup job failed again. Here is all the output files. We did run the cleanStats script, but to no avail. Thanks The related files are on voyagerii\33000\33480\June1 Let me know if you need the 'help'output 6/2/2000 8:41:37 AM schapman Statistics rollup log: Job started by Scheduler at '01/06/2000 01:17:29'. ----- ----- $NH_HOME/bin/sys/nhiRollupDb -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (01/06/2000 01:17:31). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Wed May 31 19:31:11 2000) ). ----- Scheduled Job ended at '01/06/2000 01:31:43'. ----- Output from Ingres error log: TCENH019::[1045 , 4075d500]: Mon May 29 03:20:38 2000 E_SC0107_BAD_SIZE_EXPAND Error expanding virtual size of server. TCENH019::[1045 , 4075d500]: Mon May 29 03:20:38 2000 E_SC0107_BAD_SIZE_EXPAND Error expanding virtual size of server. brk() failed with operating system error 12 (Not enough space) TCENH019::[1045 , 4075d500]: Mon May 29 03:20:38 2000 E_SC0107_BAD_SIZE_EXPAND Error expanding virtual size of server. TCENH019::[1045 , 4075d500]: Mon May 29 03:20:44 2000 E_SC0107_BAD_SIZE_EXPAND Error expanding virtual size of server. brk() failed with operating system error 12 (Not enough space) TCENH019::[1045 , 4075d500]: Mon May 29 03:20:44 2000 E_SC0204_MEMORY_ALLOC Error allocating memory. TCENH019::[1045 , 4075d500]: Mon May 29 03:20:44 2000 E_SC0107_BAD_SIZE_EXPAND Error expanding virtual size of server. TCENH019::[1045 , 4075d500]: Mon May 29 03:20:44 2000 E_SC0204_MEMORY_ALLOC Error allocating memory. TCENH019::[1045 , 4075d500]: Mon May 29 03:20:44 2000 E_SC0123_SESSION_INITIATE Error initiating session. TCENH019::[1045 , 401c0620]: Mon May 29 03:22:40 2000 E_SC0271_EVENT_THREAD The SCF alert subsystem event thread has been altered. The operation code is 0 (0 = REMOVE, 1 = ADD, 2 = MODIFY). TCENH019::[1045 , 400d71f0]: Mon May 29 03:22:41 2000 E_CL0621_DILRU_CLOSE_ERR Error closing a DI file in the DIlru file cache Error freeing a reserved event control block TCENH019::[1045 , 400d71f0]: Mon May 29 03:22:41 2000 E_SC0235_AVERAGE_ROWS On 42474. select/retrieve statements, the average row count returned was 70. TCENH019::[1045 , 400d71f0]: Mon May 29 03:22:41 2000 E_SC0128_SERVER_DOWN Server Normal Shutdown. TCENH019::[1045 , 400d71f0]: Mon May 29 03:22:41 2000 E_CL2518_CS_NORMAL_SHUTDOWN The Server has terminated normally. ::[II_ACP , 4013e040]: Mon May 29 03:22:47 2000 E_DM9815_ARCH_SHUTDOWN Archiver was told to shut down. tcenh019::[1040 IIGCN, 00000000]: Mon May 29 03:22:47 2000 E_GC0152_GCN_SHUTDOWN Name Server normal shutdown. ::[LOGSTAT , 40114040]: Mon May 29 03:22:49 2000 E_CL2537_CS_MAP_SSEG_FAIL Failure in attempt to map the UNIX system segment (shared memory 6/5/2000 9:53:51 AM rtrei Reassigning to Yulun while Ekta is away. Yulun-- even though this is not escalated, it is important, and we need to track that it gets resolved and help if needed. 6/13/2000 1:17:00 PM yzhang Hi Sheldon, From the call log. I noticed you did a lot of work on this problem. Looks you are now waiting the rollup result from customer. This problem ticket was also assigned to me. Can you also let me know the customer rollup result when it's available. Thanks Yulun (X4524) 7/5/2000 2:36:27 PM yzhang Can you have customer do the following: 1) run the cleanStats.sh again 2) drop table nh_stats1_954712799 3) finally run the rollup again 7/6/2000 5:34:38 PM yzhang Can you send me the following two tables with data so I can load the tables into my database for further investigation. Also can you tell me the database version. nh_stats0_954712799 nh_stats0_954755999 Thanks Yulun 7/7/2000 8:43:52 AM yzhang Sheldon, I looked at the information you put in the call ticket. I think I need the nethealth version (nh45, nh46 or nh47) the customer is running. Thanks Yulun 7/14/2000 7:34:45 AM schapman -----Original Message----- From: Chapman, Sheldon Sent: Friday, July 14, 2000 7:26 AM To: Zhang, Yulun Subject: RE: FW: Ticket 33480 Yulun, The requested files are on \\voyagerii\tickets\33000\33480\July14 Sheldon 7/19/2000 8:45:38 AM yzhang Sheldon; Thanks for the two .txt file, in order for me to load the file into the tables quickly, I would like to ask your favor to request customer again to send the files in the .dat format. You can have customer do the following on the Unix prompt, then ftp the two .dat files to you. echo "copy table nh_stats0_954712799 ( ) into 'nh_stats0_954712799.dat' \g" | sql nethealth echo "copy table nh_stats0_954755999 () into nh_stats0_954755999.dat' \g" | sql nethealth Thanks Yulun 7/19/2000 1:43:36 PM yzhang The ticket is closed 5/5/2000 9:05:32 PM SysAdmin The customer was told that NH4.6 would solve his DA and DAC issuers. It only moves the table size issue to another table. Concord as a company says we can support up to 80K elements. At this amount of elements the amount of service profiles allowed before encountering the 2 gig limit on DAC tables is not sufficient to run a business supporting multiple customers with different service profiles needs. This issue will happen more and more as our ISP/ Telco customers increase the number of elements on their systems. The current customer is now going to pull his 1.5 Million-dollar order if this is not fixed so it scales to what Concord says can be supported. Don Customer is only trying to use 16 DAC's on 10K elemtns and we are unable to accomadate this configuration. Customer had data analysis failure with the following error : rror: Append to table nh_daily_symbol failed, see the Ingres error log file for more information (E_QE007D Error trying to put a record. (Wed Apr 5 19:33:08 2000) )._______________________________________________________________ Error from ingres error log shows: 00000119]: Wed Apr 5 19:33:05 2000 E_DM92CB_DM1P_ERROR_INFO An error occurred while using the Space Management Scheme on table: nh_daily_symbol, database: nethealth APHRODIT::[33145 , 00000119]: Wed Apr 5 19:33:05 2000 E_DM92CF_DM2F_GALLOC_ERROR Error allocating space in physical file(s) for database table. APHRODIT::[33145 , 00000119]: Wed Apr 5 19:33:05 2000 E_DM92CB_DM1P_ERROR_INFO An error occurred while using the Space Management Scheme on table: nh_daily_symbol, database: nethealth APHRODIT::[33145 , 00000119]: Wed Apr 5 19:33:05 2000 E_DM92DC_DM1P_EXTEND Error extending a file. APHRODIT::[33145 , 00000119]: Wed Apr 5 19:33:05 2000 E_DM92D0_DM1P_GETFREE Error allocating a page from free list. APHRODIT::[33145 , 00000119]: Wed Apr 5 19:33:05 2000 E_DM9257_DM1B_FINDDATA Error occurred finding disk space for a new record. APHRODIT::[33145 , 00000119]: Wed Apr 5 19:33:05 2000 E_DM925E_DM1B_ALLOCATE Error occurred allocating disk space for a new record. APHRODIT::[33145 , 00000119]: Wed Apr 5 19:33:05 2000 E_DM904D_ERROR_PUTTING_RECORD Error putting a record to database:nethealth, owner:nh, table:nh_daily_symbol. APHRODIT::[33145 , 00000119]: Wed Apr 5 19:33:05 2000 E_DM008B_ERROR_PUTTING_RECORD Error trying to put a record. APHRODIT::[33145 , 00000119]: Wed Apr 5 19:33:05 2000 E_QE007D_ERROR_PUTTING_RECORD Error trying to put a record. APHRODIT::[33145 , 00000174]: Thu Apr 6 10:29:35 2000 E_CL0608_DI_BADEXTEND Error allocating disk space write() failed with operating system error 27 (File too large) __________________________________________________________< _____ Df -k output shows plenty of disk space available: Filesystem kbytes used avail capacity Mounted on /dev/dsk/c0t0d0s0 1016207 18879 936356 2% / /dev/dsk/c0t0d0s1 1016207 389854 565381 41% /usr /proc 0 0 0 0% /proc fd 0 0 0 0% /dev/fd /dev/dsk/c0t0d0s3 492807 40219 403308 10% /var /dev/dsk/c0t0d0s5 739542 95 680284 1% /export /dev/dsk/c0t1d0s0 16970660 11589766 5211188 69% /opt /dev/dsk/c0t0d0s6 476527 9820 419055 3% /usr/local swap 1376016 20400 1355616 2% /tmp Customer has applied Patch 2 and Cert 2. see change history 5/10/2000 11:01:50 AM rtrei This is scheduled for 4.7.1 7/20/2000 3:46:59 PM rtrei The statement that this was scheduled for 4.7.1 was done before the releases got renamed. At that time, 4.7.1 was what is now called 4.8. This bug is being resolved by the sushi-DAC project. It can not be released as a patch because it has significant schema changes. The work is well started and will have no problems meeting it's 4.8 deadline. I do not know why this has stayed an escalated issue, but there is little more I can say to it until it is marked fixed. If for some reason it does not make its 4.8 schedule I will add something more to the schedule, but given its current state, it is hard to conceive why that would happen. 11/13/2000 11:42:40 AM rtrei The sushi-dac project fixed this, and the code is set for release in 4.8 B5/5/2000 9:05:33 PM SysAdmin Rollups failing on Solaris 2.6, NH vers 4.51 p12 d08. Excerpt from ingres error log: :[32800 , 00000016]: Fri Mar 31 16:15:59 2000 E_US0014 Database not available at this time. o The database may be marked inoperable. This can occur if CREATEDB failed. o An exclusive database lock may be held by another session. o The database may be open by an exclusive (/SOLE) DBMS server. ________________________________________________________________ Excerpt from stats rollup log: Begin processing (04/06/2000 17:15:58). Error: Append to table nh_stats1_954633599 failed, see the Ingres error log file for more information (E_CO0048 COPY: Copy terminated abnormally. 0 rows successfully copied because either all rows were duplicates or there was a disk full problem. _________________________________________________________________ df -k output showing disk utilization is not a problem : * note nethealth is uder /opt dir. stats% df -k Filesystem kbytes used avail capacity Mounted on /dev/dsk/c0t0d0s0 447367 33464 369167 9% / /dev/dsk/c0t0d0s1 515599 404274 59766 88% /usr /proc 0 0 0 0% /proc fd 0 0 0 0% /dev/fd /dev/dsk/c0t0d0s4 1808767 1445752 308752 83% /export /dev/dsk/c1t5d0s0 52128848 5835512 45772048 12% /opt swap 1517032 32 1517000 1% /tmp stats% _________________________________________________________________ Received list of all tables in Db and nhiIndexDiag command output (all files on voyager) _________________________________________________________________ see change history 1hA closing as associated call ticket is closed. 5/5/2000 9:05:34 PM SysAdmin If the checkpoint location has not already been specified (using nhmvCkpLocation), the user should be able to specify this when doing a DB save in the GUI. "the problem with this is that the user doesn't know. they will go ahead and do a db save with checkpoints and find later that something is missing. the error message is what points them in the right direction. why can't we just do it within the db save location." 5/18/2000 4:57:06 PM lincoln imp500 5/18/2000 4:57:39 PM lincoln changed status 5/18/2000 5:18:02 PM lincoln changes status 5/26/2000 2:18:38 PM lincoln changed status 9/1/2001 4:19:18 AM AR_ESCALATOR Administrative change. This ticket has been closed because it has been open for more than one year. If this is still a requirement, it will need to be justified and re-submitted as an Enhancement Request. 5/5/2000 9:05:35 PM SysAdmin Failed Statistics rollup: $NH_HOME/bin/sys/nhiRollupDb -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (04/12/2000 00:25:55). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. Failed Data Analysis: $NH_HOME/bin/sys/nhiDataAnalysis -u $NH_USER -d $NH_RDBMS_NAME Begin processing (04/12/2000 01:01:58). Warning: Job 1000133 'Health' is incorrectly defined (no elements in 'West_Point_Modems' apply to this report). Warning: Job 1000181 'Health' is incorrectly defined (no elements in 'US_Toll_Free_Modems' apply to this report). Warning: Job 1000182 'Health' is incorrectly defined (no elements in 'US_Toll_Free_Modems' apply to this report). Warning: Job 1000183 'Health' is incorrectly defined (no elements in 'US_Toll_Free_Modems' apply to this report). Warning: All report jobs were successfully analyzed except those listed above. End processing (04/12/2000 01:53:55). Ingres errlog: Tue Apr 11 07:02:14 2000 E_SC0216_QEF_ERROR Error returned by QEF. Tue Apr 11 07:02:14 2000 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. nhiIndexDiag.out (one of several duplicate messages): Problem with Index nh_stats1_939772799_ix1. Error: Index 'nh_stats1_939772799_ix1' was not not in the database. Duplicate problem: Found 107927 duplicates out of 215874 rows for index nh_stats1_939772799_ix1 on table nh_stats1_939772799. Saving duplicates in '/opt/health/tmp/session.nh_stats1_939772799.dat'. Note that session.nh_stats files came across corrupted All related files in Escalated tickets/33000/33740 including system messages, all log files and temp files Note - customer running 4.5.1 with P12 on HPUX 10.20 manthony - 4/18/00 Asked customer support to: (1) Run the script that will clean bad availability backfill. (2) drop table nh_stats1_954201599 (3) run the rollups. 5/8/2000 4:46:45 PM manthony Customer Reports all is well. Closing... 5/5/2000 9:05:35 PM SysAdmin Statistics_Rollup log: Begin processing (10/4/2000 10:10:23). Error: Append to table nh_stats1_951605999 failed, see the Ingres error log file for more information (E_CO0048 COPY: Copy terminated abnormally. 0 rows successfully copied because either all rows were duplicates or there was a disk full problem. Ingres errlog.log: Mon Apr 03 08:20:53 2000 E_GC0001_ASSOC_FAIL Association failure: partner abruptly released association Fri Apr 07 09:00:14 2000 E_GC0001_ASSOC_FAIL Association failure: partner abruptly released association Sun Apr 09 08:13:21 2000 E_DM9044_ESCALATE_DEADLOCK Deadlock encountered while escalating to table level locking on table iiprotect in database nethealth with mode 5. Sun Apr 09 08:13:21 2000 E_DM0042_DEADLOCK Resource deadlock. Sun Apr 09 08:13:21 2000 E_QE002A_DEADLOCK Deadlock detected. Mon Apr 10 08:20:49 2000 E_GC0001_ASSOC_FAIL Association failure: partner abruptly released association Mon Apr 10 08:23:19 2000 E_GC0001_ASSOC_FAIL Association failure: partner abruptly released association nhiIndexDiag was run but created an empty file. Having customer run this again. Still waiting for df -k and help\g | sql output. Data_Analysis and Statistics_Index were fine. All files in Escalated tickets/33000/33774 Customer running Nh4.5.1 with P11/D07 on NT 4.0 manthony 4/18/00 Requested a list of tables in the DB. Looked at tables and found that there are 17 stats0 tables corresponding to the stats1 table that rollups is failing on. This means that rollups probably failed while deleting the stats0 tables. To verify this I will ask support to run the following query: select min(sample_time_, max(sample_time) from nh_stats1_951605999\g manthony 5/5/00 Customer problem is f< ixed. Closing issue.  5/5/2000 9:05:35 PM SysAdmin Error received in Statistcis Rollup and Index logs: $NH_HOME/bin/sys/nhiRollupDb -u $NH_USER -d $NH_RDBMS_NAME Begin processing (04/13/2000 11:00:10 PM). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys.(Thu Apr 13 23:34:46 2000)). -------------------------- Customer is running 4.5.1 P11 D8 on Solaris 2.6 stats rollup/index and sys messages are in voyagerii/tickets/33000/33833 errlog.log is on a tape that eng is currently looking at on ticket 31471 manthony 4/18/00 requested that customer support: (1) run script that removes bad availability backfill. (2) select min(sample_time), max(sample_time) from nh_stats2_950417999;\g Then we can determine the next step. jpoblete 4/24/00 got the info requested by Mike manthony 4/25/00 Customer has now made it past the original point of failure. There is now a new point of failure. Asked customer support to run some queries to determine the next step. jpoblete 4/25/00 echo "select count (*) from nh_stats0_954662399 where delta_time > 0 \g" | sql nethealth INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. Ingres SPARC SOLARIS Version OI 2.0/9712 (su4.us5/00) login Tue Apr 25 10:24:36 2000 continue * * * * * * * * * * * * * * * * * * * * * Ingres Version OI 2.0/9712 (su4.us5/00) logout Tue Apr 25 10:24:36 2000 echo "select count (*) from nh_stats0_954662399\g" | sql nethealth INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. Ingres SPARC SOLARIS Version OI 2.0/9712 (su4.us5/00) login Tue Apr 25 10:24:12 2000 continue * Executing . . . +-------------+ |col1 | +-------------+ | 36019| +-------------+ (1 row) continue * Your SQL statement(s) have been committed. Ingres Version OI 2.0/9712 (su4.us5/00) logout Tue Apr 25 10:24:12 2000 echo "select min(sample_time), max(sample_time) from nh_stats1_954734399\g" | sql nethealth INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. Ingres SPARC SOLARIS Version OI 2.0/9712 (su4.us5/00) login Tue Apr 25 10:23:15 2000 continue * Executing . . . +-------------+-------------+ |col1 |col2 | +-------------+-------------+ | 954654948| 954734399| +-------------+-------------+ (1 row) continue * Your SQL statement(s) have been committed. Ingres Version OI 2.0/9712 (su4.us5/00) logout Tue Apr 25 10:23:15 2000 manthony 4/26/00 Asked customer support to drop 22 stats0 tables that have already been rolled up. The next step is to run the rollups and they should succeed. Awaiting customer feedback. manthony 4/28/00 Customer reported all is well. Closing issue. ) 5/5/2000 9:05:35 PM SysAdmin Excerpt from errlog.log: NH00C7 ::[1051 , 407f45e0]: Mon Apr 10 19:35:42 2000 E_DM93AF_BAD_PAGE_CNT Wrong number of pages read or written. At page 1448, an attempt was made to read or write 8 pages, but only 0 pages were actually processed. read() failed with operating system error 0 (Error 0) NH00C7 ::[1051 , 407f45e0]: Mon Apr 10 19:35:42 2000 E_DM9005_BAD_FILE_READ Disk file read error on database:nethealth table:nh_stats2_934693199 pathname:/usr/db/nethealth/ingres/data/default/nethealth filename:aaaabpob.t00 page:1448 read() failed with operating system error 0 (Error 0) NH00C7 ::[1051 , 407f45e0]: Mon Apr 10 19:35:42 2000 E_DM9335_DM2F_ENDFILE End of file was reached during a read or write operation. NH00C7 ::[1051 , 407f45e0]: Mon Apr 10 19:35:42 2000 E_DM920D_BM_BAD_GROUP_FAULTPAGE Error faulting a group of pages. NH00C7 ::[1051 , 407f45e0]: Mon Apr 10 19:35:42 2000 E_DM9C83_DM0P_CACHEFIX_PAGE An error occurred while fixing a page in the buffer manager. NH00C7 ::[1051 , 407f45e0]: Mon Apr 10 19:35:42 2000 E_DM9204_BM_FIX_PAGE_ERROR Error fixing a page. NH00C7 ::[1051 , 407f45e0]: Mon Apr 10 19:35:42 2000 E_DM904C_ERROR_GETTING_RECORD Error getting a record from database:nethealth, owner:health, table:nh_stats2_934693199. NH00C7 ::[1051 , 407f45e0]: Mon Apr 10 19:35:42 2000 E_DM008A_ERROR_GETTING_RECORD Error trying to get a record. Associated error messages which provide more detailed information about the problem can be found in the error log (errlog.log) NH00C7 ::[1051 , 407f45e0]: Mon Apr 10 19:35:42 2000 E_QE007C_ERROR_GETTING_RECORD Error trying to get a record. Associated error messages which provide more detailed information about the problem can be found in the error log (errlog.log) NH00C7 ::[1051 , 407f45e0]: Mon Apr 10 19:35:42 2000 E_SC0216_QEF_ERROR Error returned by QEF. NH00C7 ::[1051 , 407f45e0]: Mon Apr 10 19:35:42 2000 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. Associated error messages which provide more detailed information about the problem can be found in the error log, II_CONFIG:errlog.log NH00C7 ::[1051 , 407f45e0]: Mon Apr 10 19:35:42 2000 E_QS001E_ORPHANED_OBJ An orphaned Query Plan object was found and destroyed during QSF session exit. NH00C7 ::[1051 , 407f45e0]: Mon Apr 10 19:35:42 2000 E_QS0014_EXLOCK QSF Object is already locked exclusively. The entire file is located in \\Voyagerii\Escalated Tickets\33000\33887\errlog.log manthony 4/20/00 This error is harmless. The patch script runs nhConvertDb, but the DB did not need conversion. The patch script then tries to SQL into the DB to verify the schema rev., and that is what failed. The ingres error log contains an unusual amount of deadlock, QEF, and lock timeout errors. I am concerned that the customer is running home-grown apps against our data, which is causing these problems. I asked customer for a list of scheduled jobs. manthony - 4/24/00 Call ticket has been closed. Customer support claims that there were some directory permission problems that may have caused the error message to be reported during the patch upgrade. Closing this issue. 05/5/2000 9:05:35 PM SysAdmin Statistics rollups are failing with the following error: ----- Job started by Scheduler at '04/17/2000 08:00:49 PM'. ----- ----- $NH_HOME/bin/sys/nhiRollupDb -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (04/17/2000 08:00:50 PM). Error: Append to table nh_stats1_954392399 failed, see the Ingres error log file for more information (E_CO0048 COPY: Copy terminated abnormally. 0 rows successfully copied because either all rows were duplicates or there was a disk full problem. ). ----- Scheduled Job ended at '04/17/2000 08:02:30 PM'. ----- This is the results fo the df -k command: Filesystem kbytes used avail capacity Mounted on /dev/dsk/c0t0d0s0 4670652 559776 4064170 13% / /proc 0 0 0 0% /proc fd 0 0 0 0% /dev/fd /dev/dsk/c0t3d0s6 192833 96622 96019 51% /var /dev/dsk/c0t2d0s4 8705501 1239999 7378447 15% /export/home /dev/dsk/c0t3d0s5 8065210 9 7984549 1% /misc /dev/dsk/c0t1d0s3 8705501 6910697 1707749 81% /opt /dev/dsk/c1t4d0s3 8705501 749449 7868997 9% /opt1 /dev/dsk/c1t5d0s6 8705501 9 8618437 1% /opt2 /dev/dsk/c1t6d0s4 8705501 2324890 6293556 27% /opt3 /dev/dsk/c1t7d0s5 8705501 390477 8227969 5% /opt4 swap 3799832 38840 3760992 2% /tmp The following files can be found in the \\Voyager\Escalated Tickets\33000\33914\ directory: df-k.txt opt/nethealth/idb/ingres/files/*.log (all log files in 'files' directory) Statistics_Rollup.100000.log syslog0418.txt (system messages) Customer is running 4.5.1 P11 on Solaris 2.6. Steph N said to close this. It was addressed within support. g 5/5/2000 9:05:36 PM SysAdmin failed rollups: neth/bin/sys/nhiRollupDb Begin processing (04/17/2000 08:54:15 PM). ^[[BError: Sql Error occured< during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Mon Apr 17 23:04:13 2000) Ingres errlog: Mon Apr 17 02:16:29 2000 E_SC0216_QEF_ERROR Error returned by QEF. Mon Apr 17 02:16:29 2000 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. Associated error messages which provide more detailed information about the problem can be found in the error log, II_CONFIG:errlog.log Mon Apr 17 02:16:29 2000 E_SC0216_QEF_ERROR Error returned by QEF. Mon Apr 17 02:16:29 2000 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. Associated error messages which provide more detailed information about the problem can be found in the error log, II_CONFIG:errlog.log Tue Apr 18 06:39:16 2000 E_CL1002_LK_TIMEOUT Lock timed out Tue Apr 18 06:39:16 2000 E_DM9043_LOCK_TIMEOUT Timeout occurred during CONTROL lock request on table neth.nh_stats0_956051999 in database nethealth with mode 3. Resource held by session [27856 33618]. Tue Apr 18 06:39:16 2000 E_RD002B_LOCK_TIMER_EXPIRED Timeout has occurred on a system catalog Tue Apr 18 06:39:16 2000 E_DM004D_LOCK_TIMER_EXPIRED Lock timer expired before lock granted. Tue Apr 18 06:39:16 2000 E_PS0010_TIMEOUT Timeout has occured on a system catalog. Tue Apr 18 06:39:32 2000 E_CL1002_LK_TIMEOUT Lock timed out Tue Apr 18 06:39:32 2000 E_DM9043_LOCK_TIMEOUT Timeout occurred during CONTROL lock request on table neth.nh_stats0_956051999 in database nethealth with mode 3. Resource held by session [27856 33618]. Tue Apr 18 06:39:32 2000 E_RD002B_LOCK_TIMER_EXPIRED Timeout has occurred on a system catalog Tue Apr 18 06:39:32 2000 E_DM004D_LOCK_TIMER_EXPIRED Lock timer expired before lock granted. Tue Apr 18 06:39:32 2000 E_PS0010_TIMEOUT Timeout has occured on a system catalog. Line from nhiIndexDiag output: Problem with Index . Error: Index 'nh_stats0_955007999_ix2' was not not in the database. Duplicate problem: Found 11 duplicates out of 86220 rows for index nh_stats0_955007999_ix2 on table nh_stats0_955007999. Saving duplicates in '/neth/tmp/session.nh_stats0_955007999.dat'. System log, related files are in Escalated tickets directory/3300/33912 Customer running NH versioni 4.5.1 with P12 Support has the worksheet for this one. They sent instructions to the customer, and are waiting to here back. Estimating 1hA to review any logs or data necessary. Closing as support has closed call ticket 5/5/2000 9:05:36 PM SysAdmin Customer was getting mesages in his System Mesages windows that stated "The database does not contain enough space for more 'network element' data, dropping this poll." Conversations_Rollup.100001.log shows error: Error: Unable to execute 'MODIFY nh_dlg1s_946101599 TO BTREE UNIQUE ON sample_time, dlg_src_id, nap_id, proto_id WITH FILLFACTOR = 100, LEAFFILL = 100, NONLEAFFILL = 100' (E_US1591 MODIFY: table could not be modified because rows contain duplicate keys. (Wed Apr 19 13:05:47 2000) the following files can be found in the \\Voyagerii\Escalated Tickets\34000\34024\ directory: 4192000.txt (System messages) bdf.log errlog.log Conversations_Rollup.100001.log believe this was caused by the append/indexing not being in the same transaction. This problem was fixed in a recent patch. Sent mail to Support asking for some data, but believe this is a simple prob to fix once I get the data. I am closing this as the associated call ticket is closed. 5/5/2000 9:05:36 PM SysAdmin Statistics rollups failing with the following error: Begin processing (04/18/2000 19:00:45). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Tue Apr 18 19:17:50 2000) ----- Customer is running 4.5.1 P11 D7 on Solaris 2.6 Sys messages, errlog.log and output of sqlHelp.sh is in voyagerii\tickets\34000\34035 Loooked at data available. NOt enough to determine actual cause (ie, server crash, etc.) Recommended Shane go ahead and run the cleanStats script. Waiting to hear back. Got OK from Support to close this one out. I5/5/2000 9:05:36 PM SysAdmin Customer received an error that " rows could not be indexed because of duplicate keys" and stats roll-ups failed. error message :04/18/2000 00:16:02 DbsOcDbJobStepFailed Error: Job step 'Statistics Rollup' failed (the error output was written to /opt/nethealth/log/Statistics_Rollup.100000.log Job id: 100000) Excerpt from ingres error log: TRAFFICD::[32837 , 000005cf]: Fri Mar 31 02:22:04 2000 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. Associated error messages which provide more detailed information about the problem can be found in the error log, II_CONFIG:errlog.log TRAFFICD::[32837 , 000005cf]: Fri Mar 31 02:22:04 2000 E_SC0216_QEF_ERROR Error returned by QEF. TRAFFICD::[32837 , 000005cf]: Fri Mar 31 02:22:04 2000 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. Associated error messages which provide more detailed information about the problem can be found in the error log, II_CONFIG Have received from customer: A listing of all the tables in the database Output from the NhiINdexDiag command ingres error log system messages This is availability backfill. Sent mail to Tony. Waiting to hear back. Closing ticket as call ticke twas closed 3 days ago. 15/5/2000 9:05:37 PM SysAdmin Would like capability to load a saved database into an existing (running) db. Example as follows: INGRES databases is in an inconsistent state. - re-install INGRES and load the last backup of the database that had been taken before the crash. - However the decisive last step in the recovery operation, that of actually doing the load, was overlooked and the poller was started and began polling normally. - some days later after ran the first weekly Health Report that first noticed the old data had not been loaded. However already had new polled data for 3 days. - Have to destroy polled data and reload back db. In Network Health there is no possibility to load an existing database backup into the current running database without destroying the current data. 9/1/2001 4:19:18 AM AR_ESCALATOR Administrative change. This ticket has been closed because it has been open for more than one year. If this is still a requirement, it will need to be justified and re-submitted as an Enhancement Request. 5/5/2000 9:05:38 PM SysAdmin Converstaions Rollup failed: $NH_HOME/bin/sys/nhiDialogRollup -u $NH_USER -d $NH_RDBMS_NAME Error: Append to table nh_dlg1s_955695599 failed, see the Ingres error log file for more information (E_CO0048 COPY: Copy terminated abnormally. 0 rows successfully copied because either all rows were duplicates or there was a disk full problem.). - Data Analysis, Stats Rollup, Stats Index all fine. - System Messages shows that only the Conversations Rollup is failing. - Ingres errlog: no errors indexDiag.out output file: (one example of many) Problem with Index . Index 'nh_dlg0_956645999' had different keys than expected. Duplicates are acceptable in -- ignore any messages See Escalated Tickets directory 34000\34128 for all related files, including: log files, system messages, output of nhiIndexDiag, df -k and tables.out Customer running NH version 4.5.1 with patch 11on Solaris 2.7 manthony 4/26/00 Looked at table list and found that there are a number of TA rollup tables as well as TA raw tables missing that should be there ?!?!?! Asked customer support to find out what's going on. Awaiting feedback. manthony 5/1/00 Asked customer support for the following: "select min(sample_time), max(sample_time) from nh_dlg1s_955695599\g" and the same query for nh_dlg1b_955695599. This will tell us if we can drop the dlg0 table or if we have to drop the dlg1s table and try rollups again. 5/8/2000 5:04:24 PM manthony Asked customer support to drop nh_dlg1s_955695599 and nh_dlg1b_9556955< 99 and run a nhiDialogRollupFix executable that robin has going into patch 13. Awaiting feedback. 5/12/2000 2:15:27 PM manthony Customer reports all is well. Closing issue. 5/5/2000 9:05:38 PM SysAdmin Dialog rollups fail with the following error: Job started by Scheduler at '4/18/2000 03:05:51 PM'. ----- ----- $NH_HOME/bin/sys/nhiDialogRollup -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (4/18/2000 03:05:51 PM). Error: Append to table nh_dlg1b_950936399 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 153 rows not copied because duplicate key detected. ). ----- Scheduled Job ended at '4/18/2000 03:05:56 PM'. ----- All files requested by Eng. have been placed on voyager. Requested the customers db, wtg feedback. Sent a new nhiDialogROllup -- need to hear back from customer It looks like they are still have ing problems. I've requested the database. Once I get it, it will take me several hours to determine what is going on. New estimate: 1d 5/12/2000 9:26:32 AM rtrei Shane sent another request to customer yesterday. Still waiting to hear back. 5/17/2000 9:46:02 AM rtrei still waiting to here from the customer. 6/2/2000 12:23:22 PM rtrei Still waiting to hear from customer 7/13/2000 6:02:04 PM rtrei Closed as call ticket closed-- customer never responded. B 5/5/2000 9:05:38 PM SysAdmin Received the error that the DB was inconsistant when trying to stop the server. Although when they ran the nhForceDb command, it responded with a message indicating that the DB was NOT inconsistant. So, they tried a nhDestroyDb and it indicated that the DB was inconsistant. Requested the following: ( in the tickets directory of VoyagerII) infodb nethealth - logdump - verifydb -odbms_catalogs -mreport - log files from ingres/files directory - system messages. excerpt from ingres error log: DM0100_DB_INCONSISTENT Database is inconsistent. ::[3697 , 404F8120]: Tue Apr 25 11:12:14 2000 E_DM0152_DB_INCONSIST_DETAIL Database nethealth is inconsistent. ::[3697 , 404F8120]: Tue Apr 25 11:12:14 2000 E_SC0121_DB_OPEN Error opening database. Name: nethealth Owner: netadmin Access Mode: 00000002 Flags 00000000 ::[3697 , 404F8120]: Tue Apr 25 11:12:14 2000 E_SC010D_DB_LOCATION Database Location Name: $default Physical Specification: /sbapps/sbapps/nethealth/idb/ingres/data/default/nethealth Flags: 00000003 ::[3697 , 404F8120]: Tue Apr 25 11:12:14 2000 E_DM0100_DB_INCONSISTENT Database is inconsistent. ::[3697 , 404F8120]: Tue Apr 25 11:12:14 2000 E_US0026 Database is inconsistent. please contact the ingres system manager ::[3697 , 404F8120]: Tue Apr 25 11:12:14 2000 E_SC0123_SESSION_INITIATE Error initiating session. ::[3697 , 404F8120]: Tue Apr 25 11:12:14 2000 W_DM5422_IIDBDB_NOT_JOURNALED WARNING: The iidbdb is being opened but journaling is not enabled; this is not recommended. ::[3697 , 404F8120]: Tue Apr 25 11:12:14 2000 W_DM5422_IIDBDB_NOT_JOURNALED WARNING: The iidbdb is being opened but journaling is not enabled; this is not recommended. ::[II_RCP , 4013FD80]: Tue Apr 25 14:13:08 2000 E_DMA469_PROCESS_HAS_DIED Process (000016F1) has died. A process attached to the INGRES logging and locking system has exited without going through normal cleanup processing. The system will now perform cleanup processing on behalf of the failed process. manthony 4/26/00 Customer has forced, and saved DB. Customer is now loading DB. Awaiting feedback. manthony 4/27/00 Customer reported all is well. Closing issue. }5/5/2000 9:05:38 PM SysAdmin Attached are some errors from a DB load from a 4.5 save to a 4.6 new install (all patches). Let me know if I should worry about anything, it's polling OK as far as I can tell. Andy Gerber TTTTTTTTTTTTTTTTT Loading table nh_var_units . . . Loading table nh_hourly_health . . . Loading the sample data . . . Error: Uncompress of file /dbsave/MWF.tdb/nh_stats0_956645999 failed. Table nh_rlp_boundary inconsistent, deleting row: type: ST stage: 0 max_range: 956642399. Table nh_rlp_boundary inconsistent, deleting row: type: ST stage: 0 max_range: 956645999. Updating a prior version 4.5 database . . . Begin processing Copying new fields into the Database (04/25/2000 02:54:55 PM). End processing Copying new fields into the Database (04/25/2000 02:56:03 PM). Creating the Table Structures and Indices . . . Creating the Table Structures and Indices for sample tables . . . Granting the Privileges . . . Granting the Privileges on the sample tables . . . Load of database 'nethealth' for user 'neth' completed successfully. End processing (04/25/2000 05:13:56 PM). The following files can be found in the \\Voyagerii\Escalated Ticket\34000\34272\ dir. load_log.txt std_out.txt (Command line output) manthony 4/26/00 It looks like 2 hours worth of raw stats data did not get loaded. The data is from 4/25/00 1:00 AM - 3:00 AM EST. Both files did not get loaded due to a bug in 4.5.1 save that was fixed in patch 11. Other than the 2 hours of data everything looks OK. Awaiting customer feedback. manthony 4/26/00 Call ticket was closed. ^5/5/2000 9:05:39 PM SysAdmin Customer is running NH vers 4.6. no patches applied. When trying to start NH they received an error message stating the DB was inconsistant. Exact error message: nethealth requires an existing, accessible database. However, INGRES returned the following error when an attempt was made to access 'nethealth': E_US0026 Database is inconsistent. please contact the system manager (Wed Apr 26 07:54:27 2000) Error message from errlog.log ALBATROS::[32796 , 00000018]: Wed Apr 26 08:25:55 2000 E_SC0123_SESSION_INITIATE Error initiating session. ALBATROS::[32796 , 0000001d]: Wed Apr 26 08:48:07 2000 E_DM0100_DB_INCONSISTENT Database is inconsistent. ALBATROS::[32796 , 0000001d]: Wed Apr 26 08:48:07 2000 E_DM0152_DB_INCONSIST_DETAIL Database nethealth is inconsistent. ALBATROS::[32796 , 0000001d]: Wed Apr 26 08:48:07 2000 E_SC0121_DB_OPEN Error opening database. Name: nethealth Owner: cnh Access Mode: 00000002 Flags 00000000 ALBATROS::[32796 , 0000001d]: Wed Apr 26 08:48:07 2000 E_SC010D_DB_LOCATION Database Location Name: $default Physical Specification: /disk2/NetHealth/idb/ingres/data/default/nethealth Flags: 00000003 Information returned from infodb shows: The Database is Inconsistent. Cause of Inconsistency: REDO_ERROR Bugged due to policy of bugging and escalating all DB issues. manthony 4/28/00 Customer support has instructed customer on how to fix issue. Awaiting customer feedback. manthony 5/2/00 Customer reported all is well. Closing this issue. 5/5/2000 9:05:41 PM SysAdmin Customer running NH 4.5.1 p11 D08' Customer has scheduled nhFetchDb job, received email notification that this failed. Error returned: SQL Error: E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Fri Apr 28 13:10:22 2000) Requested copy of ingres error log, but the log revailed no error messages. However, IndexDiag ouput did reveal errors : Table is lacking an index. Duplicate problem: Found 0 duplicates out of 22 rows for index job_schedule_ix on table nh_job_schedule. Problem with Index . Error: Index 'nh_stats0_956930399_ix1' was not not in the database. Duplicate problem: Found 10 duplicates out of 44609 rows for index nh_stats0_956930399_ix1 on table nh_stats0_956930399. Saving duplicates in '/opt/nethealth/tmp/session.nh_stats0_956930399.dat'. Problem with Index . Error: Index 'nh_stats0_956930399_ix2' was not not in the database. Duplicate problem: Found 10 duplicates out of 44609 rows for index nh_stats0_956930399_ix2 on table nh_stats0_956930399. Saving duplicates in '/opt/nethealth/tmp/session.nh_stats0_956930399.dat'. Analysis of i< ndexes on database 'nethealth' for user 'nhadmin' completed successfully. Issue bugged and escalated due to DB escalation policy Have them send the output files containing the duplicates so we can determine where they came from. Support knows how to clean them up, and the Fetch should work. I am hoping these are availability backfill and not duplicates from the remote site - which we have code to prevent - or so we think. Customer is fixed and running - original problem with scripts sent by support. No product code to fix. 5/5/2000 9:05:41 PM SysAdmin Customer is running NH vers 4.5.1 patch level 7 Cert 2 Noticed in his database status ouput that his Statistics-Rollups were failing since a while back Statistics_Rollup.log lists this failure due to duplicate keys Begin processing (04/27/2000 08:00:08 PM). Error: Append to table nh_stats2_944369999 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 82 rows not copied because duplicate key detected. Requested ingres error log, list of tables, nhiIndexDiag output,, and system messages. Errors listed in ingres error log KIRK ::[32786 , 00000018]: Fri Feb 4 15:43:08 2000 E_US0014 Database not available at this time. o The database may be marked inoperable. This can occur if CREATEDB failed. o An exclusive database lock may be held by another session. o The database may be open by an exclusive (/SOLE) DBMS server. KIRK ::[32786 , 000034f5]: Thu Apr 27 07:58:19 2000 E_SC0216_QEF_ERROR Error returned by QEF. KIRK ::[32786 , 000034f5]: Thu Apr 27 07:58:19 2000 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. Associated error messages which provide more detailed information about the problem can be found in the error log, II_CONFIG:errlog.log bug and escalate due to DB issues. This is a standard duplicate on a pre-patch 10 database. I don't anticiapte doing any work unless support runs into something unexpected. 1hA Asked Steph to get the problem stats1 table so I can look at the dups. Got call from Steph: call ticket is closed, I should close this one. 5/5/2000 9:05:42 PM SysAdmin Statistic Rollup and Statistic Indexing are failing and customer is getting DMT SHOW errors. Statistics_Index.100005.log ----- Job started by Scheduler at '2/5/2000 14:20:18'. ----- ----- $NH_HOME/bin/sys/nhiIndexStats -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (2/5/2000 14:20:18). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Tue May 02 13:20:31 2000) ). ----- Scheduled Job ended at '2/5/2000 14:20:31'. ----- _____________________________________________________________________________________________________________ Statistics_Rollup.100000.log ---- Job started by Scheduler at '27/4/2000 19:30:47'. ----- ----- $NH_HOME/bin/sys/nhiRollupDb -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (27/4/2000 19:30:51). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Thu Apr 27 18:32:34 2000) ). ----- Scheduled Job ended at '27/4/2000 19:32:36'. ___________________________________________________________________________________________________________ A non-recoverable DMT SHOW errors from errlog.log: 00000172 Wed Apr 26 21:45:37 2000 E_DM9004_BAD_FILE_OPEN Disk file open error on database:nethealth table:nh_stats0_953013599 pathname:E:\nethealth\oping\ingres\data\default\nethealth filename:aaaalckp.t00 open() failed with operating system error 32 (The process cannot access the file because it is being used by another process.) 00000172 Wed Apr 26 21:45:37 2000 E_DM923F_DM2F_OPEN_ERROR Error occurred opening a file for a table. 00000172 Wed Apr 26 21:45:37 2000 E_DM9336_DM2F_BUILD_ERROR Error building a File Control Block. 00000172 Wed Apr 26 21:45:37 2000 E_DM9C5B_DM2T_OPEN_TABIO Error occurred opening a Table Control I/O Block. 00000172 Wed Apr 26 21:45:37 2000 E_DM9C8B_DM2T_TBL_INFO An error occurred while attempting to build the Table Control Block for table (45743,0) in database nethealth. 00000172 Wed Apr 26 21:45:37 2000 E_DM9C89_DM2T_BUILD_TCB An error occurred while building a Table Control Block for a table. 00000172 Wed Apr 26 21:45:37 2000 E_DM9C8A_DM2T_FIX_TCB An error occurred while trying to locate and/or build the Table Control Block for a table. 00000172 Wed Apr 26 21:45:37 2000 E_RD0060_DMT_SHOW Cannot access table information due to a non-recoverable DMT_SHOW error 00000172 Wed Apr 26 21:45:37 2000 E_DM010B_ERROR_SHOWING_TABLE An error occurred while showing information about a table. TVSN0088::[II\INGRES\184 , 00000172]: Wed Apr 26 21:45:37 2000 E_RD0060_DMT_SHOW Cannot access table information due to a non-recoverable DMT_SHOW error Ingres log files, system log files are in: //voyagerii/tickets/34000/34452/ directory. Further files asked for are the output of: infodb, logdump.verifydb -odbms_catalogs -mreport -sdbname nethealth, and verifyTables output have been requested. ***********mail to support: Sheldon-- This actually looks like a standard duplicate problem. I'm not sure that we need to worry about the DMT show problem-- when I look in the errlog.log, it was just caused by 2 systems trying to access the table info at the same time-- it wasn't a problem that the file wasn't found (like the other one you are working on). So, my recommendation is to concentrate on the duplicates problem and let the DMT show work go unless we see something new. I suspect that once we get the data, this will turn into a standard availability backfill problem and you can just run the cleanStats script. It has all the parameters. (It is patch 3 that closes the last of the holes, so they can still occur with patch 2.). Ask them if they shut down their NT system properly-- the errlog.log indicates a large number of powerdowns without shutdowns which is asking for problems with any database system. Other than that, I am concerned that the discover process seemed to hang a couple of times. It needs to go to Dave Shepard, but we might want to run a command (next time) it is hung to see if it is waiting on the database. The utility is called ipm and it would be available in the ingres tools program group. Let me know if you want to run this and I will give better instructions. 5/17/2000 9:46:54 AM rtrei Waiting to hear from customer on results. 7/13/2000 6:03:36 PM rtrei closing as call ticket is closed. 5/5/2000 9:05:42 PM SysAdmin Failed Statistics Rollup: Begin processing (5/1/2000 08:00:26 PM). Error: Append to table nh_stats1_949726799 failed, see the Ingres error log file for more information (E_CO0048 COPY: Copy terminated abnormally. 0 rows successfully copied because either all rows were duplicates or there was a disk full problem.). IndexDiag output: - Problem with Index nh_stats1_945838799_ix1. Error: Index 'nh_stats1_945838799_ix1' was not not in the database. Duplicate problem: Found 1449 duplicates out of 2898 rows for index nh_stats1_945838799_ix1 on table nh_stats1_945838799. Saving duplicates in 'H:/nethealth/tmp/session.nh_stats1_945838799.dat'. - Problem with Index nh_stats1_945838799_ix2. Error: Index 'nh_stats1_945838799_ix2' was not not in the database. Duplicate problem: Found 1449 duplicates out of 2898 rows for index nh_stats1_945838799_ix2 on table nh_stats1_945838799. Saving duplicates in 'H:/nethealth/tmp/session.nh_stats1_945838799.dat'. Ingres Errlog.log: Mon May 01 13:41:02 2000 E_GC0001_ASSOC_FAIL Association failure: partner abruptly released association Tue May 02 07:51:12 2000 E_DM9042_PAGE_DEADLOCK Deadlock encountered locking page< 6 for table nh_stats0_956933999_ix1 in database nethealth with mode 5. Resource held by session Tue May 02 07:51:12 2000 E_DM0042_DEADLOCK Resource deadlock. Tue May 02 07:51:12 2000 E_QE002A_DEADLOCK Deadlock detected. Tue May 02 07:52:19 2000 E_SC0216_QEF_ERROR Error returned by QEF. Tue May 02 07:52:19 2000 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. Stats index is fine. System messages and list of all tables are coming. Related files in Escalated tickets\34000\34457 Customer running NH version 4.6h (no patches) on NT 4.0 Looks like 2 probs: a completely duplicate stats1 table (which is older) which is probably what the load complained about, and a standard stats1 append problem. Support needs to get a complete listing of the tables to decide how to proceed. 2hA 5/9/2000 10:36:56 AM rtrei Awaiting notice from support that all is ok 5/12/2000 9:29:55 AM rtrei closing as associated call ticket is closed. @5/5/2000 9:05:42 PM SysAdmin Roll ups are failing due to duplicate table nh_stats2_908665199' The following files are in Voyager _II esc. tickets 34000/34032 errlog.log stat's roll up log everything in $NH_HOME/tmp Statistics_Index.100005.log Analysis.txt The output of echo "help\g" | sql nethealth > tables.out Software patch cert patch Reviewed data available. It is some kind of database corruption problem. Since it has been going on for 3 weeks, my recommendation to Sheldon was to destroy, create, and reload (do a save beforehand if needed). We won't spend much time pursuing as we are expecting a new patch, and the customer needs immediate recovery. However, it is interesting to note that there was no indication of a shutdown (bad or otherwise) immediately prior to the problem occurring. Awaiting to hear back from support how the problem proceeds. Logged info with CA to see if they have any suggestions. ********** Recommended to Sheldon they do a destroy, create, reload-- database is too corrupted for us to try and fix an hope nothing goes wrong. Do not expect to give any additional input unless requested from support. 1hA 5/10/2000 11:03:20 AM rtrei This does not require a code change 5/17/2000 9:14:45 AM rtrei Still waiting to hear back from customer. 5/17/2000 10:07:48 AM rtrei changing status to moreinfo until hear back from customer. 5/18/2000 6:33:32 PM smcafee Moved Low priority issues to postponed. Needs review for readme. 5/19/2000 8:20:03 AM smcafee Not a 4.7 issue. Should not have been postponed. 5/25/2000 9:35:56 AM rtrei resetting status to moreinfo. 5/30/2000 9:44:01 AM rtrei Still awaiting customer feedbacl regarding if new hard drive has been installed. 6/7/2000 11:36:13 AM rtrei Thisi s being closed as the call ticket is closed. q5/5/2000 9:05:43 PM SysAdmin Statistics_Index fails with following: Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. IndexDiag.out: (note - have same message for ix2). Problem with Index nh_stats0_956267999_ix1 Error: Index 'nh_stats0_956267999_ix1' was not not in the database. Duplicate problem: Found 65 duplicates out of 77534 rows for index nh_stats0_956267999_ix1 on table nh_stats0_956267999. Saving duplicates in '/opt/isv/neth/tmp/session.nh_stats0_956267999.dat'. Ingres Errlog.log: Fri Apr 28 08:04:16 2000 E_SC0216_QEF_ERROR Error returned by QEF. Fri Apr 28 08:04:16 2000 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. Associated error messages which provide more detailed information about the problem can be found in the error log. Note that Statistics Rollups and Data Analysis are fine. No failed jobs in system messages. All related files in Escalated Tickets\34000\34398 Customer running NH version 4.6 with P02 and D02 on HP 11.0 Looks like avaibility backfill. Had support run the cleanStats script. Unless something else crops up, I don't expect any further involvement on this one. 0hA 5/10/2000 11:04:39 AM rtrei this does not require code change 5/10/2000 11:32:55 AM rtrei Customer's problem is resolved. 75/5/2000 9:05:43 PM SysAdmin Failed stats rollups: Begin processing (04/26/2000 07:00:49 PM). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys.] Failed stats index: Begin processing (05/03/2000 12:21:00 PM). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. indexDiag.out: Problem with Index . Error: Index 'nh_stats0_954705599_ix1' was not not in the database. Duplicate problem: Found 51 duplicates out of 10846 rows for index nh_stats0_954705599_ix1 on table nh_stats0_954705599. Saving duplicates in '/opt/nethealth/tmp/session.nh_stats0_954705599.dat'. Ingres errlog.log: Sat Apr 15 03:12:52 2000 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. System messages shows Stats Rollup, Data Analysis, and two Health jobs failing (1000396 and 1000421). Note that Data Analysis error is limited to these two failing Health jobs being improperly defined. Output of df -k shows no problems. All related files in Escalated Tickets\34000\34483 Customer running NH version 4.6 with P02 on Solaris 2.6 Standard AB dups. Tony has already run cleanStats. Awaiting feedback from the customer. 0hA ready to close. 5/5/2000 9:05:43 PM SysAdmin In the stats rollup receving: error: append to table nh_stats1_949816799 failed, see the ingres errlog for more information. E_Co0048 copy: copy terminated abnormally. 0 rows successfully copied because either all rows were duplicates or there was a disk full problem. Ran the tania.sh script on voyagerii which did: #!/bin/sh #. ./nethealthrc.sh echo "help\g" | sql nethealth > help.nt.txt Echo "Select min(sample_time) from nh_stats1_949816799\g" | sql nethealth > minTime.out Echo "Select max(sample_time) from nh_stats1_949816799\g" | sql nethealth > maxTime.out Got: got mintime.out, maxtime.out, help.nt.txt but we had used the wrong number so we had to re-run it. Re-ran it and email server down so had to have customer do work. He inserted the output of min and output of max into the help.nt.txt He counted 22 tables. NOT 24. I am having him email these (mintime.out, maxtime.out, help.nt.txt) when email is back up. Will place on voyagerii. ************** Reviewed work with Jose. This happened awhile ago, and matches the profile of a standard stats1 rollup failure dup. Jose is deleting the stats0 tables that have already been rolled up. Do not expect they will need any more input from me. Will wait to hear back from Jose. 1hA ************* Had customer delete stats0 tables, and then run nhiRollupDb, customer will call when either finished or failed. 5/10/2000 11:02:10 AM rtrei this does not require a code change 5/12/2000 9:31:47 AM rtrei Rollups are ok. Customer has problems with reports. 5/16/2000 1:22:41 PM jpoblete Customer called me, after delete a report that was causing problems all other scheduled jobs ran OK. 5/17/2000 9:12:59 AM rtrei closing, call ticket closed. R5/5/2000 9:05:44 PM SysAdmin Stats rollup failing due to following error: Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. Customer is running 4.6 P2 D2 ingres log files, stats rollup log, system log, output of sqlHelp are in voyagerii\34000\34602 5/10/2000 11:04:58 AM rtrei This does not require code change 5/12/2000 9:33:31 AM rtrei under control. Call ticket remaining open for a few days to check continued status w customer 5/17/2000 9:15:22 AM rtrei closing as call ticket is closed. $5/9/2000 10:58:31 AM drecchion Roll ups failing Append to table nh_stats1_953873999 failed The following is in the esc. tickets directory on voyagerII 34000/34745 errlog.log stats rollup log stats index log The< transaction log size is 100 MB The total space on the drive is 1.0 GB The patch level and rev is 4.5.1 Patch 10 5/9/2000 11:19:02 AM manthony Asked support to get a help\g and follow worksheets provided by engineering. 5/10/2000 10:40:30 AM manthony Customer reports all is well closing issue. n5/9/2000 11:13:06 AM drecchion Server down- rollups failing on nh_stats1_953791199 failed customer hit the wall at 96% usage The following documentation is on voyager II on 34000/34728 errlog.log stats roll up df -k datanalysis.log software version is 4.1 Patch level is u 5/9/2000 2:48:26 PM manthony Discovered that customer deleted their TX log file. I had them recreate it and DB came up. Now having them save/destroy/create and load the DB. When that is completed I will look into the rollup problems they are having. Awaiting help\g output and rollups output. 5/10/2000 10:45:11 AM manthony Customer loaded DB, but encountered dups. Sent customer a script to clean them up. Customer shoudl then run rollups. Awaiting fix feedback. 5/12/2000 11:41:12 AM manthony Customer is up and running and rollups are working again. Closing issue. {5/11/2000 12:38:59 PM dpatel Statistics rollups failing with following error: Begin processing (05/07/2000 09:05:05). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Sun May 7 09:10:50 2000) ). No entries in errlog.log since April 24, 00 The following files are in voyagerii\34000\34750 Stats rollup log and index log... <> <> <> <> <> <> <> <> <> <> <> <> <> <> <> <> <> <> <> < Customer is not patched. 5/15/2000 1:57:24 PM dpatel sqlHelp output is on voyagerii\tickets\34000\34750 5/15/2000 4:15:09 PM manthony Asked support to: (1) run cleanStats -clean (2) drop table nh_stats1_956807999 (3) run rollups. This should fix the problem. Awaiting fix feedback. 5/17/2000 12:31:22 PM jpoblete customer did what Mike Anthony asked and now his rollups are OK. 5/17/2000 1:23:35 PM manthony Customer reported all is well. Closing this issue. 05/11/2000 12:41:30 PM dpatel Statistics rollups failing with folllowing error: Begin processing (2000/05/09 20:00:28). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Tue May 9 23:01:01 2000) ). All relevant files are in voyagerii\34000\34735 All ingres log files, output of sqlHelp, output from nhIndexDiag, stats rollup log, save.log, DA log... Customer is not patched. 5/17/2000 3:08:22 PM manthony Customer reported all is well. Closing issue. 5/11/2000 7:29:29 PM jnormandin Customer is running NH 4.51 p 12. on Solaris 2.6 Excerpt from stats rollup log $NH_HOME/bin/sys/nhiRollupDb -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (05/10/2000 02:55:06 PM). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Wed May 10 18:02:57 2000) ). Exceprt from errlog.log 0000007a]: Thu May 11 17:55:37 2000 E_DM9713_SR_WRITE_ERROR Error occurred writing to sort work file -- path = /opt/nethealth/idb/ingres/work/default/nethealth, block 0. write() failed with operating system error 28 (No space left on device) INO-CON-::[34470 , 0000007a]: Thu May 11 17:55:37 2000 E_DM9703_SR_WRITE_ERROR Error occurred writing a sort work file. Check disk space and quota on the sorting device(s). write() failed with operating system error 28 (No space left on device) INO-CON-::[34470 , 0000007a]: Thu May 11 17:55:37 2000 E_DM0112_RESOURCE_QUOTA_EXCEED Error allocating resource; resource limit exceeded. INO-CON-::[34470 , 0000007a]: Thu May 11 17:55:37 2000 E_QE0052_RESOURCE_QUOTA_EXCEEDED Out of disk space, disk quota, or open file quota. INO-CON-::[34470 , 0000007b]: Thu May 11 17:55:49 2000 E_DM0112_RESOURCE_QUOTA_EXCEED Error allocating resource; resource limit exceeded. INO-CON-::[34470 , 00000023]: Thu May 11 17:59:03 2000 Output from indexDiag Table is lacking an index. Duplicate problem: Found 0 duplicates out of 15 rows for index job_schedule_ix on table nh_job_schedule. Problem with Index . Error: Index 'nh_stats1_956991599_ix1' was not not in the database. Duplicate problem: Found 0 duplicates out of 0 rows for index nh_stats1_956991599_ix1 on table nh_stats1_956991599. Problem with Index . Error: Index 'nh_stats1_956991599_ix2' was not not in the database. Duplicate problem: Found 0 duplicates out of 0 rows for index nh_stats1_956991599_ix2 on table nh_stats1_956991599. Had him run the cleanStats script, received the following: dedupinfo.out nh_stats1_956991599 dupinfo.out nh_stats1_956991599 After script was run, had him run the nhIndexDiag again. Output: Table is lacking an index. Duplicate problem: Found 0 duplicates out of 15 rows for index job_schedule_ix on table nh_job_schedule. Problem with Index . Error: Index 'nh_stats1_956991599_ix1' was not not in the database. Duplicate problem: Found 0 duplicates out of 0 rows for index nh_stats1_956991599_ix1 on table nh_stats1_956991599. Problem with Index . Error: Index 'nh_stats1_956991599_ix2' was not not in the database. Duplicate problem: Found 0 duplicates out of 0 rows for index nh_stats1_956991599_ix2 on table nh_stats1_956991599. Script was unsuccesfull in removing the dups. 5/12/2000 9:58:02 AM manthony Asked support to make sure there is enough disk space on the system. The customer has run out of space in II_WORK. to get rollups working: (1) drop table nh_stats1_956991599 (2) run rollups. Awaiting fix feedback. 5/12/2000 4:56:40 PM manthony Customer reported all is well. Closing issue. p5/12/2000 5:38:25 PM drecchion Roll up's failing on unknown table. Customer to send roll up log to Jose. All related documentation is on voyager II 34000/34887 indexdiag.out tables1.out staqtsindex 1000005.log speckeysd.lock sdtvolcheck15915.dat ps_data.dat errlog.log df -k.info.log daemonstat.any.80 tables1.wri Customer is on patch 2 5/15/2000 9:40:37 AM rtrei Jose-- It looks like you should drop this table: ==========949381234======================================================== nh_stats0_949467599 neth table ==========949467599======================================================== You don't have to worry about losing data as it seems to already be there. Let me know if I've missed something, but from the data in the escalated tickets directory, this seems straightforward. 1hA 5/18/2000 6:33:35 PM smcafee Moved Low priority issues to postponed. Needs review for readme. 5/19/2000 8:22:01 AM smcafee Not a 4.7 issue. Should not have been postponed. 5/25/2000 9:44:55 AM rtrei closing as call ticket closed. >5/17/2000 7:28:21 AM foconnor Customer would like to request an enhancement request for the nhExportConfig command. As of now, you cannot (as far as I know from my testing) export configurations with the paramenters -subjName and -elemType at the same time. Or rather, the result is not what you expect. For example, I would like to be able to list all WAN elements from a specific group. Right now, you can EITHER list the config of (all) the elements in a group/groupList OR you can list the config of all element (in the entire database) that are WAN elements. When this command is run: nhExportConfig -dciOut out.dci -subjName groupName -subjType group -elemTy< pe anyWan results in all the elements in that group and the -elemType is apparently ignored. Customer is Michael Nillson of Cygate 9/1/2001 4:19:20 AM AR_ESCALATOR Administrative change. This ticket has been closed because it has been open for more than one year. If this is still a requirement, it will need to be justified and re-submitted as an Enhancement Request. 5/18/2000 10:16:14 AM drecchion Roll ups are failing due to table nh_stats1_958017599. The following files are in the escalated tickets directory on Voyager II Stats_roll up.1000005.log Data_analysis.1000002.log system log. tables.out indexdiad.out statistics_index.100005.log Customer is currently on an unpatched 4.6 install. 5/18/2000 1:27:31 PM manthony Requested the following query: select min(sample_time), max(sample_time) from nh_stats1_958017599\g Awaiting output. 5/23/2000 1:15:45 PM manthony Sent customer support a script to fix the issue. Awaiting fix feedback. 6/5/2000 12:51:43 PM manthony Customer reported all is well. Closing this issue. 75/18/2000 3:06:45 PM tcordes Per procedure, gathered: A list of all the tables in the database The output from nhiIndexDiag. All the logs in $II_SYSTEM/ingres/files/*.log The Nethealth system messages. All the files in $NH_HOME/tmp Customer reran nhiIndexDiag and still had duplicate rows. This output is also in the tickets directory. 5/19/2000 11:31:59 AM rtrei I reviewed the data in the Escalated tickets directory. Tara thought there might be a problem because the second nhiIndexDiag output contained error messages. However, the error message was just that a stats1 index was missing, not that there were any duplicates left. I think running nhiIndexDb -u -d to reapply the index is all that is needed. 5/25/2000 9:45:26 AM rtrei closing as call ticket closed. -5/19/2000 4:58:18 PM jnormandin Customer is running NH vers 4.6 P2 D02. Received message in console that statistics roll-up failed: Job step 'Statistics Rollup' failed (the error output was written to /opt/nethealth/log/Statistics_Rollup Examining the Statistics_Rollup.log revealed: Begin processing (05/16/2000 07:00:18 PM). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Tue May 16 22:06:10 2000) ). The entry in the errlog.log reveals: Wed May 17 13:37:00 2000 E_SC0216_QEF_ERROR Error returned by QEF. CHEWBACC::[32815 , 000059a4]: Wed May 17 13:37:00 2000 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. Associated error messages which provide more detailed information about the problem can be found in the error log, II_CONFIG:errlog.log I requested a copy of his DB tables (tables.out file). I also requested he run the nhIndexDiag.The file revealed that he had 14 duplicates. According to the "Database Troubleshooting Guide" the correct procedure for under 100 dups, is to run the cleanStats script. All relevant error logs and files are located on Voyagerii in the escalated tickets directory. 5/23/2000 8:55:29 AM manthony Asked customer support to: (1) drop table nh_stats1_957855599 (2) run the cleanStats script (3) run rollups This should fix the problem. Awaiting fix feedback. 5/30/2000 9:29:44 AM manthony Customer reported all is well. Closing this issue. 5/23/2000 9:30:00 AM foconnor Scheduled conversation rollups are failing and producing core dumps. Rollups ran from the command line fail with segmentation faults and core dump. There is plenty of space on the partition that Network Health and ingres resides. Errlog.log was showing Transaction abort errors and customer has increased the transaction log from 300 to 500 megs and conversation rollups are still failing. Conversation Rollup logs do not give any useful infromation. From system log: Job 'Conversations Rollup' finished (Job id: 100001, Process id: 11462). Tuesday, May 23, 2000 04:07:08 AM Job step 'Conversations Rollup' failed (the error output was written to /opt/concord/log/Conversations_Rollup.100001.log Job id: 100001). Tuesday, May 23, 2000 06:00:55 AM Starting job 'Conversations Rollup' . . . (Job id: 100001, Process id: 12708). Tuesday, May 23, 2000 08:07:10 AM Job step 'Conversations Rollup' failed (the error output was written to /opt/concord/log/Conversations_Rollup.100001.log Job id: 100001). Tuesday, May 23, 2000 08:07:10 AM Job 'Conversations Rollup' finished (Job id: 100001, Process id: 12708). From nhDbStatus: Number of Probes: 11 Number of Nodes: 155892 df -k Filesystem kbytes used avail capacity Mounted on /dev/md/dsk/d2 963869 47673 858364 6% / /dev/md/dsk/d17 1986439 691988 1234858 36% /usr /proc 0 0 0 0% /proc fd 0 0 0 0% /dev/fd mnttab 0 0 0 0% /etc/mnttab /dev/md/dsk/d11 1616095 7563 1560050 1% /var swap 2894136 0 2894136 0% /var/run swap 2895592 1456 2894136 1% /tmp /dev/md/dsk/d20 963869 1995 904042 1% /opt /dev/md/dsk/d14 1616095 3568 1564045 1% /export/home /dev/md/dsk/d8 8068801 2447251 5540862 31% /opt/concord /vol/dev/dsk/c1t6d0/nh460h_rtm 568798 568798 0 100% /cdrom/nh460h_rtm Network Health is on /opt/concord This is a striped filesystem, so transaction.log is also on that filesystem. Files: core file, files in $NH_HOME/tmp, log files from $II_SYSTEM/ingres/files/*.log, system log, df -k and nhDbstatus //voyagerii/tickets/35000/35278 Also requested ram, swap and kernel parameters 5/25/2000 9:41:25 AM rtrei sent mail to Support yesterday (see call log). They forwarded instructions on to customer. Waiting to hear back. (Problem almost certainly due to blowing the stack size.) 6/9/2000 9:51:40 AM rtrei Instructions solved problem. Support says ok to close issue. (Current workaround for blowing the stack size on nhiDailogRollup is to unlimit the stack in the nethealthrc.csh file: "unlimit" A remedy ticket has been logged for a better, long-term fix. "5/24/2000 5:46:21 PM wburke Client has 4 gig HD. Originally called in with a maintenance error. But stated he went from 302,147,584 Free w/33433 files to 271,561,728 Free w/33555 files. errorlog shows rollup failure. Starting job 'Statistics Rollup' . . . (Job id: 100000, Process id: 440). Friday, January 28, 2000 20:02:10 Job step 'Statistics Rollup' failed (the error output was written to C:/nethealth/log/Statistics_Rollup.100000.log Job id: 100000). Friday, January 28, 2000 20:02:10 Job 'Statistics Rollup' finished (Job id: 100000, Process id: 440) 5/31/2000 3:39:00 PM manthony Asked support for the following info: select count(*) from nh_stats0_947397599 and the number of elements being polled. 6/6/2000 3:04:22 PM manthony Scheduled rollup ran while running rollups from the command line. Asked customer support to re-run and if rollups fail then we will need a list of tables. Awaiting feedback. 6/12/2000 9:51:25 AM yzhang The rollup succeed. This ticket is closed. Yulun Zhang 5/25/2000 8:17:29 AM foconnor Customer statistic rollups failing with append to nh_stats1* failed (E_CO003F COPY: Warning: 18 rows not copied because duplic ate key detected.) From the Statistic Rollup Log: The Statistics Rollup job fails with the following error message: ----- $NH_HOME/bin/sys/nhiRollupDb -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (24/05/2000 17:15:56). Error: Append to table nh_stats1_951861599 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 18 rows not copied because duplic ate key detected. ). ----- Scheduled Job ended at '24/05/2000 17:16:40'. errlog.log is in //voyagerii/tickets/35000/35427 Due to time< zone differences (South Africa) further files to collect as per Robin's database document are forthcoming at a later time Network Health 4.5.1 P13 D10 Solaris 2.7 5/25/2000 9:43:10 AM rtrei Support has sent instructions to customer. Awaiting feedback as to results. 5/26/2000 5:22:21 AM schapman -----Original Message----- From: Pieter Becker [mailto:pieter@snscon.co.za] Sent: Friday, May 26, 2000 3:21 AM To: O'Connor, Farrell Cc: Technical Team Subject: Re: Call ticket 35427: Statistic Rollups Failing w/follow up instructions. Farrell, After running all the scripts, I re-ran the Statistics Rollup Job and it completed sucesfully. You may close the call. 5/25/2000 5:50:02 PM wburke Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys.] This information was requested: Current NetHealth patch level. run infodb nethealth > infodb.out. logdump > trans_log.out verifydb -odbms_catalogs -mreprt -sdbname > name.out all logs in $II_SYSTEM/ingres/files/*.log NetHealth system messages. save messages throught console. database->save systemlog as. 5/30/2000 9:38:46 AM rtrei Yulun-- Reassigning to you. This should be straightforward. 6/12/2000 4:13:00 PM yzhang The rollup is working, anf the ticket was closed 5/30/2000 11:41:19 AM tbailey This bug is being logged only to document and provide reference for this error message. Customer received following error in system messages: Wednesday, May 24, 2000 11:32:18 AM Error (nhiPoller[Dlg]) Append to table nh_dlg0_959187599 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 1 rows not copied because duplicate key detected. Customer ran nhiIndexDiag with output as follows: Duplicate problem: Found 0 duplicates out of 13 rows for index job_schedule_ix on table nh_job_schedule. Duplicates are acceptable in -- ignore any messages Support has been notified that this is a benign message by Robin Trei. Related files in Escalated Tickets\35000\35447 Customer running NH version 4.6 with P03/D02 on Sol 2.5.1 6/2/2000 12:21:55 PM rtrei It would be good to get this fixed for the 4.7.1-- it is a simple fix, does not affect anything else, and this utility has been useful to Support. 2h to fix and test 1/16/2001 3:45:15 PM wzingher Assigning to Ha Bui, target release 5.0 6/18/2001 4:40:04 PM rtrei Reassigning to Yulun for B3. Yulun, this is a low priority, but it would be good to get this updated with latest tables, etc. 6/18/2001 4:41:00 PM rtrei Actually, I think we should up the estimate to 1 day for learning, plus putting in the new tables. 10/29/2001 11:37:38 AM wzingher Moving to 5.6 as this is ingres-specific. 1/22/2002 5:37:19 PM wzingher Not supporting Ingres in 5.6, declining. v5/30/2000 2:25:56 PM dpatel Statistics rollups failing due to following error: Begin processing (05/29/2000 01:50:56 PM). Error: Append to table nh_stats1_957243599 failed, see the Ingres error log file for more information (E_CO0029 COPY: Copy terminated abnormally. 0 rows successfully copied.). Customer is unpatched. sqlHelp output, min/max sample time output, ingres logs and system messages are located in voyagerii\tickets\35000\35296 5/31/2000 4:37:05 PM yzhang sent a email to customer support, and asked them to run the attached scripts to drop some stats0 tables, then run the rollups again 6/14/2000 9:26:30 AM yzhang still waiting customer's response 10/13/2000 2:33:19 PM yzhang waiting for customer about the result of running the script 12/11/2000 12:15:00 PM pkuehne Closed pending more information...Peggy Anne Kuehne 12/11/00 5/30/2000 2:57:52 PM tbailey Statistics_Rollup log: Begin processing (05/29/2000 04:08:54 PM). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. Statistics_Index log: Begin processing (05/30/2000 10:20:19 AM). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. Ran nhiIndexDiag.out and saw the following duplicates: Problem with Index . Error: Index 'nh_stats0_958499999_ix1' was not not in the database. Duplicate problem: Found 20 duplicates out of 35488 rows for index nh_stats0_958499999_ix1 on table nh_stats0_958499999. Saving duplicates in '/opt/concord/tmp/session.nh_stats0_958499999.dat'. Problem with Index . Error: Index 'nh_stats0_958499999_ix2' was not not in the database. Duplicate problem: Found 20 duplicates out of 35488 rows for index nh_stats0_958499999_ix2 on table nh_stats0_958499999. Saving duplicates in '/opt/concord/tmp/session.nh_stats0_958499999.dat'. Ingres Errlog had QEF errors: Thu May 25 12:40:04 2000 E_CLFE06_BS_WRITE_ERR Write to peer process failed; it may have exited. System communication error: Broken pipe. Sat May 27 07:04:38 2000 E_SC0216_QEF_ERROR Error returned by QEF. Sat May 27 07:04:38 2000 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. Mon May 29 00:06:08 2000 E_SC0216_QEF_ERROR Error returned by QEF. Mon May 29 00:06:08 2000 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. Tue May 30 06:02:54 2000 E_SC0216_QEF_ERROR Error returned by QEF. Tue May 30 06:02:54 2000 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. Output of df -k: Filesystem kbytes used avail capacity Mounted on /dev/dsk/c0t0d0s0 172991 19990 135702 13% / /dev/dsk/c0t0d0s1 246881 184639 37554 84% /usr /dev/dsk/c0t0d0s4 246881 168223 53970 76% /var /dev/dsk/c0t0d0s5 246881 61239 160954 28% /export/home /dev/dsk/c0t0d0s6 7225548 6471756 681537 91% /opt /dev/dsk/c0t2d0s3 966382 717171 191229 79% /idblog /dev/dsk/c0t3d0s7 1952573 939209 818107 54% /nethealthDB swap 1765528 272 1765256 1% /tmp System messages: Statistics_Rollup, Data_Analysis and some Health and Trend jobs are failing. Data_Analysis log - unrelated issue: Begin processing (05/30/2000 12:35:58 AM). Warning: Job 1000275 'Health' is incorrectly defined (no elements in 'ATM_switch_ports' apply to this report). Warning: Job 1000276 'Health' is incorrectly defined (no elements in 'ATM_router_ports' apply to this report). All requested files are in Escalated tickets\35000\35494 Customer running NH version 4.5.1 with P13/D08 on Solaris 2.6 5/31/2000 5:51:50 PM yzhang asked to run cleanstats scripts, drop a stats1 table, then run the rollups again. 6/9/2000 11:31:54 AM yzhang rollup succeed, the teckit closed 5/31/2000 11:39:52 AM tcordes Customer's conversati0ons rollups are failing. No error indicated in rollups log; message appears in system messages. Ingres error log indicates three problems: E_DM930C_DM0P_PAGE_CHKSUM_FAIL Page Checksum failure detected. Database nethealth, Table nh_dlg1b_954565199, Page 5439 E_DM0042_DEADLOCK Resource deadlock. E_SC0216_QEF_ERROR Error returned by QEF. Log files, contents of $NH_HOME/tmp/ directory, config.dat and system messages are on Voyager. 6/2/2000 11:58:19 AM rtrei Shane-- Just reviewed the call log on this one. Are you sure the Rollups are failing? To me, it looked like the customer had duplicates when they were trying to import their dialog data. This problem would go to Brad Carey. You have some of the dlg tables in the tmp files you copied over. He will be interested in looking at those. As far as the page checksum error: That happened on April 8th, and hasn't been repeated. It looks like either the file or the table got corrupted. I recommend dropping the table: nh_dlg1b_954565199. If you can't drop it, see me, I have a new command to try for such tables. I saw that this customer forced his database consistent in February. Please check that they did a destroy/create and reload at that time to be sure this isn't a leftover problem from that. If they did do < so, run the verifytables script on this database to see what you get. (See worksheet on corrupted databases.) Putting this in moreinfo state until hear back from you. 6/6/2000 11:43:21 AM rtrei Reassigning this problem to Brad-- The checksum error is no longer a problem. However, dlg poller is failing to insert due to duplicates. 6/15/2000 10:26:54 AM don There is already bug #9138 addressing the duplicate key in insert bug. 5/31/2000 6:42:00 PM jnormandin Customer is running NH 4.6 P3 D02. Is experiencing Stats Rollup failures. Statistics_Rollups.xxxxxxxl.log shows: ----- Job started by Scheduler at '05/31/2000 12:00:38 AM'. ----- ----- $NH_HOME/bin/sys/nhiRollupDb -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (05/31/2000 12:00:39 AM). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Wed May 31 01:06:47 2000) ). ----- Scheduled Job ended at '05/31/2000 12:06:50 AM'. ----- errlog.log shows: TRAFFICD::[49395 , 00000ab7]: Wed May 31 07:01:29 2000 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. Associated error messages which provide more detailed information about the problem can be found in the error log, II_CONFIG:errlog.log TRAFFICD::[49395 , 00000d61]: Wed May 31 11:20:56 2000 E_CL1004_LK_DEADLOCK Deadlock detected TRAFFICD::[49395 , 00000d61]: Wed May 31 11:20:57 2000 E_DM9045_TABLE_DEADLOCK Deadlock encountered locking table neth.nh_stats0_959785199 in database nethealth with mode 5. Resource held by session [11464 ab7]. TRAFFICD::[49395 , 00000d61]: Wed May 31 11:20:57 2000 E_DM0042_DEADLOCK Resource deadlock. TRAFFICD::[49395 , 00000d61]: Wed May 31 11:20:57 2000 E_QE002A_DEADLOCK Deadlock detected. Output from the nhiIndexDiag command shows there are over 500 duplicate keys. Customer experienced stats rollups failing on 4.18.2000 This situation was resolved by the clean dups script. 6/2/2000 9:28:37 AM apier Checked the information obtained. Looks like stats0 duplciates sSent the cleanStats script and instructions to rollup DB Awaiting results 6/2/2000 12:20:11 PM rtrei Looks like Tony has this under control. Setting to more info until hear closed or a problem. 6/13/2000 11:01:44 AM manthony Customer reported all is well. Closing issue. e6/2/2000 3:19:34 PM mmcnally Roll ups failing due to append to table " Append to table nh_stats1_952664399 failed " 6/5/2000 9:50:34 AM rtrei SDupport is handling this problem. Do not expect further input unless notified of unusual problems. 6/5/2000 10:38:40 AM rtrei estimate 1hA 6/21/2000 12:41:54 PM rtrei closing. call ticket closed x6/2/2000 3:27:32 PM mmcnally Roll ups failing due to append to table " Append to table nh_stats1_952664399 failed " _6/2/2000 3:40:22 PM cpaschal Statistics Rollups failing with append to table error: Error: Append to table nh_stats1_955684799 failed, see the Ingres error log file for more information (E_CO0048 COPY: Copy terminated abnormally. 0 rows successfully copied because either all rows were duplicates or there was a disk full problem. 6/5/2000 9:48:36 AM rtrei Reviewed call log. Jose is taking appropriate actions. DO not expect further input unless Support notifies me of problem. 6/5/2000 10:39:16 AM rtrei estimate 1hA 6/12/2000 10:50:27 AM rtrei closing as call ticket is closed /6/5/2000 3:12:22 PM mmcnally Richard got the following error running a database save. Fatal Internal Error: Unable to execute 'COPY TABLE nh_stats0_958773599 () INTO 'D:/nethealth/Save/nethealth.tdb/nh_stats0_958773599'' (E_RD0060 Cannot access table information due to a non-recoverable DMT_SHOW error (Thu Jun 01 00:22:09 2000) ). (cdb/DuTable::saveTable) ----- Scheduled Job ended at '6/1/2000 12:22:12 AM'. ----- Had him issue the following commands. 1) sql nethealth should see the * prompt 2) drop table nh_stats0_958773599;log should see * prompt again with no error sucessful did a database save When the database save completed, he got the following error. non-recoverable DMT_SHOW. 6/6/2000 9:52:05 AM manthony Asked support to recover from an archived save (destroy, create, and load). If that is not possible requested verifydb output to determine the next step. Awaiting that info. 6/16/2000 4:00:29 PM manthony Customer reported all is well. Closing issue. 6/16/2000 4:00:57 PM manthony xx 6/6/2000 9:13:06 AM tstachowicz The following message is written to the " Statistics_rollup.10000.log": ----- Job started by Scheduler at '22/5/2000 07:00:21'. ----- ----- $NH_HOME/bin/sys/nhiRollupDb -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (22/5/2000 07:00:22). Error: Unable to execute 'DROP TABLE nh_stats0_957869999' (E_US125C Deadlock detected, your single or multi-query transaction has been aborted. (Mon May 22 07:02:15 2000) ). ----- Scheduled Job ended at '22/5/2000 07:02:16'. Have collected the following information in escalated tickets voyagerii 35267: - system logs - Statistics_Rollup.log - Statistics_Index..log - Data_Analysis..log - errlog.log - output of a df - k command - Echo "Help\g" | sql nethealth > tables.out - output from nhiIndexDiag command: - nhiIndexDiag -u nhuser -d nethealth > indexDiag.out - All the files in $NH_HOME/tmp directory 6/6/2000 10:57:09 AM manthony Rollups suffered a deadlock on a system table while in the process of dropping the stats0 tables. As a result the rest of the stats0 tables for that day need to be dropped for rollups to succeed. Sent customer support a script to accomplish this. Awaiting fix feedback. 8/9/2000 4:57:16 PM manthony Customer reported all is well. Closing this issue. 6/6/2000 1:36:17 PM jnormandin Customer is running Nethealth version 4.6 P02 and is experiencing conversations roll-up failures. These failures have brought his drive to 100% capacity. Some drive spaced was freed up. From the convesations rollup log: Conversations Log: Begin processing (06/04/2000 12:05:33 AM). Error: Append to table nh_dlg1s_959842799 failed, see the Ingres error log file for more information (E_LQ002D Association to the dbms has failed. This session should be disconnected. E_LC0030 GCA protocol service (GCA_SEND) failure with message type GCA_CDATA. Internal service status E_GCfe06 -- Write to peer process failed; it may have exited. - System communication error: Broken pipe.. ). From the ingres error log: NETMGMT ::[32834 , 00000021]: Fri Jun 2 20:26:46 2000 E_DM920C_BM_BAD_FAULT_PAGE Error faulting a page. NETMGMT ::[32834 , 00000021]: Fri Jun 2 20:26:46 2000 E_DM9C83_DM0P_CACHEFIX_PAGE An error occurred while fixing a page in the buffer manager. NETMGMT ::[32834 , 00000022]: Fri Jun 2 20:26:46 2000 E_DM93A7_BAD_FILE_PAGE_ADDR Page 8436 in table , owner: $ingres, database: nethealth, has an incorrect page number: 0. Other page fields: page_stat 00000000, page_l< og_address (00000000,00000000), page_tran_id (0000000000000000). Corrupted page cannot be read into the server cache. NETMGMT ::[32834 , 00000021]: Fri Jun 2 20:26:46 2000 E_DM9204_BM_FIX_PAGE_ERROR Error fixing a page. These errors are not listed in Robin Treis DB troubleshooting guide. All information gathered is on Voyagerii. 6/7/2000 10:05:49 AM rtrei They ran out of disk space. The specified table is certainly corrupt. Hopefully not the whole database. Trying to track down who owns this ticket now. The following needs to be done: 1. Get db save asap (none since May) 2. get the info requested for a corrupt database. 3. Find space. 4. destroy, create, and reload database. 6/21/2000 12:43:24 PM rtrei call ticket should be closing soon, just open to be sure no problems for a few days 7/13/2000 6:06:36 PM rtrei closing as call ticket closed 6/7/2000 3:46:39 PM mmcnally When we started NH database using "nhStartDb" command we see following error. > > Starting OpenIngres servers... > > ...started successfully. > > Trimming system tables in the Network Health database 'nethealth' > > Sysmoding database 'nethealth' . . . > > > > Modifying 'iiattribute' . . . > > E_US1208 Duplicate records were found. > > (Tue May 23 10:50:50 2000) > > > > Sysmod of database 'nethealth' abnormally terminated. > 6/7/2000 5:54:06 PM manthony The call ticket tells customer how to fix. This problem ticket was opened for a question about root cause. The root cause of duplicate rows in iiattribute is due to multiple instances of one or more of the ingres processes running at the same time. There was a problem fixed in 4.5P10 related to this issue. If this installation ever existed prior to 4.5P10 then it is likely that this problem occured back then and the duplicate entry has existed since then. One way to make sure that this does not occur in the future is to make sure using "ps -ef | grep ing" that no ingres processes are running BEFORE starting ingres with nhStartDb. Closing this issue. 6/8/2000 12:06:48 PM mmcnally When she ran the nhFetch database it told her she had a duplicate key. Deleted them from poller config, ran a discovery, and it didn't update elements. 6/13/2000 9:07:56 AM rtrei This is not a database issue. THe problem was that the customer was polling the same item in two different databases and had problems when she tried to do remote polling. Support said they could handle it entirely, so am going ahead and closing this. 6/12/2000 10:52:08 AM jnormandin Customer is running NH vers 4.51 P11 D07. She is experiencing stats rollup failures due to duplicate keys. Statistics_Rollup_xxxx.log: Begin processing (06/05/2000 10:00:16 PM). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Mon Jun 5 22:07:58 2000) A df-k output of the system revealed no disk space issues. Had the customer retrieve a list of the tables with the help\g command and also had the customer run the nhiIndexDiag command. This ouput revealed less than 100 duplicates in the DB. Consulting the troubleshooting guide, I sent her the clean dups script. * note: all relevant files located on Voyagerii Sent customer the cleanstats.sh script. This remedied the issue 6/13/2000 9:09:32 AM rtrei Closing as call ticket is closed. Support handled it entirely (and correctly) -6/12/2000 2:23:21 PM mmcnally Error append to table n_stats2_956462399 failedE_co003f34rows not copy because duplicate key detected. last good Dbsave was on Monday 06/05/00. 6/13/2000 9:27:32 AM rtrei Sent mail to Shane recommending that we either drop the problem stats1 table, or reindex the stats2 to allow dups and then clean it up after rollups have occurred. (This is still caused by availaibility backfill) Need to hear back from Shane regarding what he wishes to do. 6/16/2000 3:56:32 PM rtrei Closing as call ticket is closing. 6/13/2000 1:52:05 PM mmcnally Conversation roll up failure due to corrupt database 6/13/2000 2:31:46 PM rtrei Not enough data in the Call Log to even know what the problem is. Awaiting more information to be puton Voyagerii 6/15/2000 12:41:26 PM rkeville I have recived the log files, they are on voyagerii. 6/21/2000 12:48:47 PM rtrei Askewd Support to verify customer is at latest patch level. If problem still occurs, I need the database. 6/30/2000 12:02:17 PM rtrei I recieved the CD and ran it against 4.6 nhiDialogRollup. Rolled up fine. Need to investigate why fails for customer. 7/7/2000 3:47:53 PM rtrei The nhiDialogRollup for 4.6 did not make it into patch3. Will talk with CM about getting it into patch 5. Meanwhile, I put the nhiDialogRollup I tested with up on our outgoing site for the customer to download. 7/27/2000 5:12:37 PM rtrei setting to more info until I get db 8/22/2000 11:09:52 AM don Bob Keville has a copy of the new database on site 8/22/2000 4:05:04 PM czarba Currently loading database onto sodium (my machine). We are expecting the load to take several hours. 8/23/2000 10:14:42 AM czarba We were able to reproduce the problem and worked around it by deleting 2 tables ( nh_dlg0_960231599 and nh_dlg0_960245999). It looks like some data from these 2 tables had already been included in a previous rollup, (we're not sure why ) so the new rollup attempted to create keys which already exist in the rollup table. Because of the amount of time this problem has been open we opted for clearing it and we'll research the cause later. The customer will lose 8 hours worth of data from June 19, 2000. 8/30/2000 8:24:33 AM don customer is all set. 6/13/2000 2:30:45 PM squintilio The customers rollups are failing with the folowing error: Job started by Scheduler at '06/13/2000 10:40:47 AM'. ----- ----- $NH_HOME/bin/sys/nhiDialogRollup -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (06/13/2000 10:40:49 AM). Error: Append to table nh_dlg1b_959745599 failed, see the Ingres error log file for more information (E_CO0048 COPY: Copy terminated abnormally. 0 rows success fully copied because either all rows were duplicates or there was a disk full problem. ). ----- Scheduled Job ended at '06/13/2000 10:40:57 AM'. ----- 6/14/2000 10:12:46 AM rtrei Sent to Support: Shane- There doesn't seem to be anything in data on Voyagerii for this one yet. THe first thing I would do would be to verify that they really are at patch 13. If you recall, we had a new Dialog Rollup executable which went out in patch 13. In the past, it has fixed all of these problems (except for that NT one we never heard back on). I would give better than 50-50 odds that they really aren't up to patch 13 and are running into this fixed problem. If they aren't, then we will need to get a copy of their database, as I will have to trace through the code to see what is going wrong. 6/16/2000 11:21:35 AM rtrei Customer did not send a complete database. Asked for another try. 6/21/2000 8:54:49 PM rtrei Have database. Loaded it. Confirmed problem. 6/22/2000 3:54:52 PM rtrei The rollup code seemed to work fine. The problem is that the tables do not look as they should. The first 6 tables have only 3-8 elements. The rest of 30000. Also, it is trying to insert the rolled up first table (the three elements roll into a single roll) into a table that is in the middle of the range. In other words, tables have been rolled up after it. It looks like import data came in late. or they changed timezones, or something funny is happening. Need to get Support to ask the customer some questions. 6/30/2000 12:29:12 PM rtrei All seems well according to the customer. However, they sent in another db save so that I could load and examine for myself. (Don't know when I will get a chance to do that.) I'm still asking for the original list of help tables so that I can check the create date on those weird tables. 7/10/2000 2:12:51 PM apier "help\g" sent to Robin Trei 7/13/2000 4:28:< 24 PM apier Customer does not use import poller 7/13/2000 5:17:31 PM rtrei Tables were created 2-3 days after they should have been. I am consulting with Brad Carey on where to look next. 7/25/2000 2:49:36 PM rtrei database is on voyager. 8/3/2000 4:06:44 PM rtrei This is related to 37149/9899 That is, that is a ticket from the same group where they thought the same problem was happening. In that case, the rollups ran fine after the stack size was reset. Could these weird tables have something to do with blowing the stack at a wrong moment? 1/11/2001 12:10:53 PM don Customer running bug closed 6/13/2000 7:35:18 PM mmcnally system crashed ran out of disk space. Error: Append to table nh_stats1_953787599 failed 6/15/2000 4:09:34 PM yzhang Don, Here is something the customer can do to solve the problem: Use script stats1Dup.sh to remove nh_stats1_953787599 table, then run the rollup again. This should solve the problem. the script can be found from voyagerii\Escalated Tickets\scripts. also tell customer that the duplicate message in the load.log is not a error. This message will appear when the database being loaded contains tables with duplicate key. Yulun 6/16/2000 8:58:48 AM yzhang Bob Here is something the customer can do to solve the problem: Use script stats1Dup.sh to remove nh_stats1_953787599 table, then run the rollup again. This should solve the problem. the script can be found from voyagerii\Escalated Tickets\scripts. also tell customer that the duplicate message in the load.log is not a error. This message will appear when the database being loaded contains tables with duplicate key. I noticed that you asked customer to drop table nh_stats1_959331599. But this table is not in the table list I appreciate if you keep me on the current status of the problem Yulun 10/13/2000 2:35:04 PM yzhang waiting for more info. 10/13/2000 2:36:40 PM yzhang waiting for more info 12/11/2000 12:26:04 PM pkuehne Closed Pending more information...Peggy Anne Kuehne 12/11/00 $D6/14/2000 2:16:43 PM squintilio We keep getting the folliwng error. We have already run the cleanStats script many times. jabba-log[8]% !4 cat Statistics_Index.100005.log ----- Job started by Scheduler at '2000/06/14 00:20:09'. ----- ----- $NH_HOME/bin/sys/nhiIndexStats -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (2000/06/14 00:20:09). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Wed Jun 14 03:20:23 2000) ). ----- Scheduled Job ended at '2000/06/14 00:20:24'. ----- jabba-log[9]% 6/16/2000 12:13:17 PM rtrei Customer says dups are being created in their stats0 tables every night at 11. Little research they have done seems to indicate that it is related to a particular element. Have asked for the tables with the dups, and the nh_element table, as well as a full nhiIndexDiag output. 6/19/2000 10:12:35 AM rtrei Still awaiting output from nhiIndexDiag, plus nh_element and dup tables. 6/21/2000 8:40:17 PM rtrei I've requested copies of the tables witht he dups and the nh_element table. Call Log says they are recieved, but I can't find them on Voyagerii. 6/22/2000 2:25:31 PM rtrei this looks like availaibility backfill is back. I'm asking for the entire database: Steph-- I've loaded the table with the duplicates and it sure does look like a variant of availability backfill. We need to find out what is going on at their site that this should be happening at all, let alone every night. Please assure them that we are taking this problem very seriously, but will need their help in our investigation. The first thing I will need is their database. Please also tar up everything in the log and tmp directories as well. I believe the customer has said that this happens every night after 11:00. Have they noticed any other events that might be related? I would also like an ls-l of the $NH_HOME\bin\sys directory. 1. Did they notice this before applying patch 4 (and patch 4 did not make it better) or did it start appearing on patch 4. 2. Do they stop the poller and restart it at night? Once they send us the database, they can go ahead and clean up. If they could just ftp us the tables that get duplicates as they occur, I think that should be sufficient and we would appreciate it. If it happens with the regularity and frequency that the customer mentions, we should get this under control for them quickly. The cleanStats.sh script should work on this, and work easier than what they are doing. Please help them get that going. 6/27/2000 1:56:58 PM snorman Sent Robin More debug 6/28/2000 6:10:12 PM rtrei Steph-- How many tar files make up the ocmplete set? No point in my looking at it until I get them all. Also, some of the files are 0 blocks, so I think there is a problem somewhere down the line... Let's try this: They are missing a lot of polls, and I suspect this might be related to how the duplicates are getting in. Have them do the following: 1. Run nhCleanStats to clean up their duplicates and get rollups going again so they can stop worrying about disk space. 2. Once everything is rolled up, turn on advanced logging for the poller. Run for 4-5 polls. Ship that data + the poller.cfg to me. (I'll have D.S. look check for poller problems and if anything can be tuned.) 3. Did this problem occur before the patch, or just start once the patch was applied? 4. See if they can set up an ftp site on their machine that I could ftp to and try to pull the entire database. I've had that happen before where the customer could not get the file up here before the time out, but I could pull it from their system. 6/30/2000 5:21:14 PM rtrei THe files tarred up were just a few stats tables with duplicates. I really need the fiull database. This problem will go over to Tony next week, and hopefully he can coax a real database from this customer next week. Without that, the best I can offer is to follow the instructions I've already sent. 7/21/2000 9:55:33 AM apier DB is here. 7/21/2000 11:45:20 AM apier DB is on \\voyagerii\tickets\35000\35693 7/21/2000 12:32:29 PM rtrei Chris-- Come see me about this. This needs to be loaded into the 4.6 noway system. 7/21/2000 1:18:53 PM rtrei reassigning back to me. Chris on vacation next week. 7/21/2000 2:37:20 PM rtrei Tried to load the database. It is not a complete save. Don't have acomplete listing of what is missing , but at least the nvr_b23 file which determines the schema as well as all the stats tables I would need to look at. (I listed the tar before I extracted it, both the listing and the extraction worked properly, so I don't think this is a tar file corruption problem, I think it is a tar file creation problem. ) Setting status back to MoreInfo until I have a good database. I would really like to get this database as we are anxious to find out what is going on. 7/25/2000 3:11:14 PM rtrei Support is trying to get another database. 7/27/2000 7:17:13 PM jpoblete Got Customer DB, it's located on: //voyager ii/Escalated Tickets/35000/35693/save.tdb 7/31/2000 11:18:44 AM rtrei Loaded database. Looking into it. 7/31/2000 11:31:49 AM rtrei oops, though I had loaded it. Was for a different ticket. Need to get the tar or this one moved to solaris. 8/3/2000 3:18:57 PM rtrei loading into /export/noway2/46TA as db_35693 8/9/2000 10:36:33 AM rtrei Duplicate data is being collected to be passed on to Dave Shepard 8/10/2000 2:42:03 PM rtrei Reassigning to Dave Shepard. His analysis is the next step. 8/10/2000 4:42:15 PM don Rollups are still failing at this customer site 8/14/2000 6:04:28 PM jpoblete The Rollups are not failing, but seems that the Statistics Index are having problems I asked customer to send the log of that process. Customer is not down. They are polling at 15 minute rate. They said Statistics Index is having problems (I'm awaiting for the log) They note the DB is steadi< ly growing (The rollups are not failing). Here is the Statistics Rollup Log: jabba-log[9]% cat Statistics_Rollup.100000.log ----- Job started by Scheduler at '2000/08/14 05:00:11'. ----- ----- $NH_HOME/bin/sys/nhiRollupDb -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (2000/08/14 05:00:13). End processing (2000/08/14 05:19:21). ----- Scheduled Job ended at '2000/08/14 05:19:21'. ----- Below is the whole history ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Here are some excerpts from conversation with customer: -----Original Message----- From: Poblete, Jose Sent: Thursday, August 10, 2000 5:16 PM To: 'nethealth@net.gov.bc.ca'; 'Colin.Kopp@gems3.gov.bc.ca' Subject: Concord Call Ticket 35693 ..... We're wondering if the rollups are now successful, and if after do some cleanup on the poller configuration, you experienced less missed polls. -----Original Message----- From: Kopp, Colin ISTA:EX [mailto:Colin.Kopp@gems3.gov.bc.ca] Sent: Monday, August 14, 2000 11:03 AM To: 'Poblete, Jose' Cc: 'nethealth@net.gov.bc.ca' Subject: RE: Concord Call Ticket 35693 No - the problem is still occurring so I guess the answer to your second question is yes we are still missing polls (at the 15 minute interval no less). -----Original Message----- From: Poblete, Jose Sent: Monday, August 14, 2000 12:00 PM To: 'Kopp, Colin ISTA:EX'; Poblete, Jose Cc: 'nethealth@net.gov.bc.ca' Subject: RE: Concord Call Ticket 35693 I'm still uncertain if the Statistics Rollups are successful or not. Have you done some cleanup and maintenance of your poller configuration removing old elements and updating the ones which bring SNMP errors? How many good polls you currently have ? How many Bad polls ? Would you please send us a copy of your Statistics Rollup Log located on $NH_HOME/log directory ? -----Original Message----- From: Kopp, Colin ISTA:EX [mailto:Colin.Kopp@gems3.gov.bc.ca] Sent: Monday, August 14, 2000 5:39 PM To: 'Poblete, Jose' Cc: 'nethealth@net.gov.bc.ca' Subject: RE: Concord Call Ticket 35693 Jose, We no longer poll at the 5 minute rate. We went back to the 15 minute rate about a month ago hoping that the duplicate problem would go away, however, the problem of duplicates is still happening (22 yesterday). It is the Statistics_Index job that is failing - the Statistics_Rollup job is not complaining at this time (it did originally), however it is obvious that things aren't working correctly as disk space gets steadily consumed. On our NH 4.6 machine (jabba) we did do some cleaning up. Currently good polls are around 15,100, bad polls 450. On our NH 4.1.5 machine (borgo), which polls the same devices, more or less, the good polls are around 15,400 , bad polls 1,200 - with the key point being this older release does not suffer the same duplicate problem. Here is a copy of the log (we had since cleared up the duplicates).... jabba-log[9]% cat Statistics_Rollup.100000.log ----- Job started by Scheduler at '2000/08/14 05:00:11'. ----- ----- $NH_HOME/bin/sys/nhiRollupDb -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (2000/08/14 05:00:13). End processing (2000/08/14 05:19:21). ----- Scheduled Job ended at '2000/08/14 05:19:21'. ----- -----Original Message----- From: Poblete, Jose Sent: Monday, August 14, 2000 5:46 PM To: 'Kopp, Colin ISTA:EX' Subject: RE: Concord Call Ticket 35693 / 9405 Would you please send me a copy of your Statistics Index Log ? Under normal conditions the DB will grow up gradually but the Statistics Rollups are in charge to limit the growing rate of the DB. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 8/14/2000 6:40:06 PM jpoblete Here is the Statistics Index Log: -----Original Message----- From: Kopp, Colin ISTA:EX [mailto:Colin.Kopp@gems3.gov.bc.ca] Sent: Monday, August 14, 2000 6:18 PM To: 'Poblete, Jose' Cc: 'nethealth@net.gov.bc.ca' Subject: RE: Concord Call Ticket 35693 / 9405 Here is the latest Statistics_Index log. What I find interesting here is that there is no 'End processing' statement. Hmmmm....a clue??? jabba-log[10]% cat Statistics_Index.100005.log ----- Job started by Scheduler at '2000/08/14 14:20:14'. ----- ----- $NH_HOME/bin/sys/nhiIndexStats -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (2000/08/14 14:20:14). ----- Scheduled Job ended at '2000/08/14 14:20:25'. ----- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ I have asked customer now for the output of nhiIndexDiag -u nhUser -d DB_Name I'll have it soon. 8/15/2000 12:59:50 PM jpoblete Here is the nhiIndexDiag command output: jabba-nwh[4]% nethealth;/usr/nh/bin/sys/nhiIndexDiag -u nwh -d nethealth Table is lacking an index. Duplicate problem: Found 0 duplicates out of 77 rows for index job_schedule_ix on table nh_job_schedule. Analysis of indexes on database 'nethealth' for user 'nwh' completed successfully. Table is lacking an index. Duplicate problem: Found 0 duplicates out of 77 rows for index job_schedule_ix on table nh_job_schedule. Analysis of indexes on database 'nethealth' for user 'nwh' completed successfully. jabba-nwh[4]% ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If more info is needed, let me now. 8/15/2000 7:46:13 PM dshepard There were an awful lot of duplicates in this database. Similar to 9911, the duplicates appear to correspond to approximate times that configuration changes were made. In this case, several scheduled discover jobs coincided with the duplicates. I'll be investigating that lead tomorrow if we can get a good build of the catfish branch. 8/31/2000 12:43:31 PM don Don, The duplicate issue is still a problem for the Government of British Columbia in version 4.6. They are polling around 20,000 elements on two machines right now (essentially they are the service provider for the provence of BC). One machine is running 4.1.5 the other machine is running 4.6. The goal is to unplug the 4.1.5 machine and put the 4.6 machine in production. The only thing that is preventing them going production with 4.6 is the duplicates in the database issue. They run a script to purge the database of duplicates, and can't go to a 15 minute poll because then the duplicates get much worse (so they are forced to use a 30 minute poll). Our contact is Colin Kopp, he has used Network Health for years and is very sharp. Feel free to contact him to get more specific details on this issue. His email is: Colin.Kopp@gems3.gov.bc.ca 9/5/2000 4:06:37 PM dshepard Requested info from the users environment as they have a consistent and frequent case of this problem. Data requested is: 1) Dump of duplicate data in stats0 tables 2) system log 3) $NH_HOME/log and $NH_HOME/tmp directory trees 4) poller.cfg file I am hoping to correlate a specific instance of duplicate records with a specific config change. 9/12/2000 4:56:39 PM dshepard I've reached the limit of what I can figure out with just looking at database records. I need to get some more information from the customers that are experiencing this problem. It appears to me that there is some correlation with config changes. My hope is that this excercise will shed some light on the root cause of the problem and allow us to fix it. I have built a new version of the poller based on the latest NH 4.6 patch (patch 5). This version will dump poller checkpoints when it sees what looks like a duplicate record being written to the database. Here is what I'd like the customers to do: 1) Upgrade to the latest 4.6 patch. 2) Delete any poller checkpoint files in $NH_HOME/tmp/pollerHistory.out 3) Install the new poller executable from the Concord ftp site. I have placed executables for both Solaris and HP as follows: /ftp/outgoing/nhiPoller.sol.dups /ftp/outgoing/nhiPoller.hp.dups 4) Stop and restart the servers 5) Check back once or twice a day for a new file at $NH_HOME/tmp/pollerHistory.out If you see that file, then the poller has detected what it thinks may be a < duplicate record. At that point, do the following: 1) Get the listdups script from the Concord ftp site in /ftp/outgoing 2) To run this script, the user must log in as nhuser, cd to NH_HOME, source their nethealthrc.csh file, and then execute the 'listdups' script. Once that runs, send the following information to Concord tech support. 1) a tar'ed or archived copy of all the files and directory structures in $NH_HOME/tmp and $NH_HOME/log 2) a saved system log 3) $NH_HOME/poller.cfg file Just FYI, the reasons for needing the above information is as follows: I wish to correlate the instances of duplicate data with config changes. That requires the system log to show when config changes occurred, the output of listdups to show when the duplicate data got inserted, and then the tmp and log directories to show what config changes occurred to which elements. Then I hope to use the poller.cfg file and the checkpoint dump from the poller to trace the control paths that led to the duplicate data for specific elements. 9/20/2000 7:19:12 PM jpoblete Dave, I have collected the information requested, it is on the call ticket directory: 39535 9/21/2000 6:30:33 PM dshepard Looking thru the data. 9/26/2000 5:13:39 PM dshepard Either the customer didn't run the listdups script properly, or there were no actual dup records created. However the info they did send point to the former. Have them run listdups again and see if it points out any duplicate records. If so, have them send the resulting data files. Otherwise I'd appreciate getting an nhExportData run for the data on MoffettCore-81-RH-Cpu-0 for the period from 9/19/00 from 7 PM to 8 PM. This was useful information. I'm still working on it, but this extra information would be very helpful. 10/2/2000 3:36:06 PM jpoblete got the requested information, sent to Dave. 10/3/2000 6:00:08 PM dshepard I have placed a new Poller executable on the outgoing website that I believe will fix the duplicate data problems in the stats0 tables. Will wait to hear results. Email sent to responsible parties in tech support. 10/6/2000 10:52:23 AM rsanginario Tuesday, October 03, 2000 6:00:08 PM dshepard I have placed a new Poller executable on the outgoing website that I believe will fix the duplicate data problems in the stats0 tables. Will wait to hear results. Email sent to responsible parties in tech support. 11/6/2000 2:09:33 PM bhinkel Changed to new Field Test state until customer approves fix. 11/6/2000 4:27:03 PM dshepard This fix has been in Field Test for 5 weeks with no repeat of the problem. I have three days left in order to get it approved for the next patch. Otherwise it has to wait over a month longer. My suggestion is therefore to put it into the next patch. There are too many customers seeing the problem to delay it another patch cycle. 11/7/2000 10:44:12 AM bhinkel Changed to Field Test until Tribunal reviews on 11/09 and fix gets checked in. 11/10/2000 5:25:45 PM dshepard Combined with ticket 9234 d 6/20/2000 3:56:08 PM tcordes Documentation indicates that an upgrade from 4.1 to 4.6 is feasible. Simon Ravenscroft at Logical has twice seen the poller.cfg fail to be converted in an upgrade. Here is what he did: - After loading a 4.1.5 database onto 4.6P3 (which took over 2 hours and showed no errors in the load.log)..I checked the poller config and found that the speeds were showing up under the Index column and the names were showing up under the agent types columns.. ie: the columns were all shifted left one place. - When running a trend report for bandwidth utilisation the content of the chart was garbage and the GUI core dumped after producing the report. We need to fix this. As more customers will be converting 4.1.5 to 4.6/4.7, this has the potential to cause huge issues for customers and support. Image of the poller.cfg GUI. load.log, and poller.cfg.log are on Voyager. 6/21/2000 2:04:46 PM rtrei Yulun-- Please look into this. Part of the conversion code does convert the poller.cfg file. Look in the nhConvertDB.sh script. (I can help take you through it as well.) Two people that it might be useful to talk with are Rob Lindberg and Dave Shepard. I expect this will need ot go out in a patch. 6/22/2000 2:37:25 PM yzhang Shedon, This problem was assigned to me, and after some initial investigation, I realized that I need the customer's 4.1.5 database, so that I can load it to 4.6 and to see if I can reproduce the problem. I will be very appreciated if I can have the database as soon as possible. Thanks Yulun (ext. 4524) 6/23/2000 9:55:10 AM manthony This is a poller cfg issue DB convert worked fine. Assigning to Mgr of that group. 7/10/2000 11:30:31 AM dshepard Looks like this has made the loop. This is a DB issue as the Plr Cfg UI uses the DB to display its info in 4.6. The original problem report was that the nh_element table is messed up. This has nothing to do with the poller.cfg file. I saw no updates to the ticket that suggested the focus of the problem was elsewhere. Hence it's going back to the DB group. 7/12/2000 1:17:05 PM rtrei Dave-- From the messages, it looks like Mike A thought that the problem happened in nhiPlrCfgCvrt. And I strongly suspect this as well. (Though it might be related to nhiPlrCfgCvrt being called multiple times.) Who owns nhiPlrCfgCnvt? We thought the poller group did. But, given how busy you and Brett are, I will be glad to have the db team locate the problem in nhiPlrCgfCvrt in more detail. We may need to get help from you or Brett to help understand the code, though. We will keep you informed of our results. Yulun-- I am reassigning to you. Come talk with me before you proceed. We may want to debug it together. Robin 7/12/2000 3:09:48 PM yzhang Sheldon, This problem reassigned to me, can you send me the 4.1.5 database or any 4.1 db if you don't have 4.1.5. So that I can reproduce the problem. It 's better if you the database for unix. Thanks Yulun 12/11/2000 12:26:40 PM pkuehne Closed Pending more information...Peggy Anne Kuehne 12/11/00 a 6/28/2000 1:41:55 PM wburke Begin processing (06/23/2000 08:00:58 AM). Error: Sql Error occured during operation (E_QE007C Error trying to get a record. Associated error messages which provide more detailed information about the problem can be found in the error log (errlog.log) (Fri Jun 23 08:01:24 2000) FROM THE ERRLOG.LOG: C-2 ::[38280 , 00003383]: Fri Jun 23 08:01:24 2000 E_DM930C_DM0P_PAGE_CHKSUM_FAIL Page Checksum failure detected. Database nethealth, Table nh_dlg0_960479999, Page 4103. NMC-2 ::[38280 , 00003383]: Fri Jun 23 08:01:24 2000 E_DM920C_BM_BAD_FAULT_PAGE Error faulting a page. NMC-2 ::[38280 , 00003383]: Fri Jun 23 08:01:24 2000 E_DM9C83_DM0P_CACHEFIX_PAGE An error occurred while fixing a page in the buffer manager. NMC-2 ::[38280 , 00003383]: Fri Jun 23 08:01:24 2000 E_DM9204_BM_FIX_PAGE_ERROR Error fixing a page. NMC-2 ::[38280 , 00003383]: Fri Jun 23 08:01:24 2000 E_DM920C_BM_BAD_FAULT_PAGE Error faulting a page. NMC-2 ::[38280 , 00003383]: Fri Jun 23 08:01:24 2000 E_DM9261_DM1B_GET Error occurred getting a record. NMC-2 ::[38280 , 00003383]: Fri Jun 23 08:01:24 2000 E_DM904C_ERROR_GETTING_RECORD Error getting a record from database:nethealth, owner:neth, table:nh_dlg0_960479999. NMC-2 ::[38280 , 00003383]: Fri Jun 23 08:01:24 2000 E_DM008A_ERROR_GETTING_RECORD Error trying to get a record. Associated error messages which provide more detailed information about the problem can be found in the error log (errlog.log) NMC-2 ::[38280 , 00003383]: Fri Jun 23 08:01:24 2000 E_QE007C_ERROR_GETTING_RECORD Error trying to get a record. Associated error messages which provide more detailed information about the problem can be found in the error log (errlog.log) 6/29/2000 10:45:02 AM rtrei Shane-- This looks like we have a corrupt database. Befor< e we do the standard destory/create/reload deal though, let's try the following: verifydb -mreport -otable nh_dlg0_960479999 -sdbname nethealth send me back the $II_SYSTEM/ingres/files/iivdb.log file. Then we will probably do a verifydb -mreun -odrop_table nh_dlg0_960479999 -sdbname nethealth 7/19/2000 9:47:43 AM apier No current failures. Customer was on vacation for 2 weeks and has seen no faiure since he returned. Call Closed V6/28/2000 7:14:12 PM cpaschal Customer has upgraded to 4.6 P3 Rollups were successful for 2 weeks after, then began failing: Error: Append to table nh_stats1_960328799 failed, see the Ingres error log file for more information (E_CO0048 COPY: Copy terminated abnormally. 0 rows succes sfully copied because either all rows were duplicates or there was a disk full problem. ). (dbu/DuTable::appendTable) 6/29/2000 10:37:32 AM rtrei Dave-- The append to table stats1 failure that this customer has is not fixed until 4.7. It looks like you are getting all the right information to solve it via the worksheets. Let me know if you need additional help, otherwise I will assume you have this under control. Robin 7/10/2000 1:05:23 PM rtrei evaluated. 7/10/2000 1:05:52 PM rtrei closed as call ticket closed 7/5/2000 12:56:09 PM tcordes Per procedure in Database Troublseshooting Guide, had customer run cleanStats script to clean duplicates as reported in nhiIndexDiag output. cleanStats reports the following cleaned: nh_stats0_960627599 nh_stats0_960631199 nh_stats0_960926399 nh_stats0_960980399 nh_stats1_960699599 However, a second sweep with nhiIndexDiag indicates: Table is lacking an index. Duplicate problem: Found 0 duplicates out of 338 rows for index job_schedule_ix on table nh_job_schedule. Problem with Index . Error: Index 'nh_stats1_957416399_ix2' had different keys than expected. Problem with Index . Error: Index 'nh_stats1_960699599_ix1' was not not in the database. Duplicate problem: Found 0 duplicates out of 0 rows for index nh_stats1_960699599_ix1 on table nh_stats1_960699599. Problem with Index . Error: Index 'nh_stats1_960699599_ix2' was not not in the database. Duplicate problem: Found 0 duplicates out of 0 rows for index nh_stats1_960699599_ix2 on table nh_stats1_960699599. Analysis of indexes on database 'nethealth' for user 'neth' completed successfully. Per procedure, logging bug. 7/6/2000 12:24:52 PM rtrei I am surprised that the cleanStats script did not index the stats 1 table. But there are a few versions floating around now, so this might be an older version. It did clean up the dups. The other messages in the nhiIndexDiag are not serious-- at some previous time a script was run that did not index those tables correctly. However, the stats0 will be dropped soon so it isn't worth worrying about, and the nh_schedule is just that the index was named something differently. The nh_stats1_960699599 table is probably indexed by now. If not, run the nhiIndexDb program. 7/25/2000 2:55:48 PM rtrei At this point, this is being watched carefully as a possible case where missed polls caused the duplicates. Tony is working on getting the db tuned. I am putting this into MoreInfo until I hear back from Tony regarding the status. In particular whether new duplciates have occurred. 8/16/2000 8:15:08 AM apier We have a copy of the database on VoyagerII. Need to get advanced logging so that we can tune the poller. 8/24/2000 2:44:27 PM apier Poller tuning complte. Call Closed 7/6/2000 12:24:29 PM foconnor Customer is experiencing conversation rollup failures with the error "Append to table nh_dlg1s_962337599 failed, see the Ingres error log file" Conversation Rollup Log Job started by Scheduler at '07/05/2000 09:15:57 AM'. ----- ----- $NH_HOME/bin/sys/nhiDialogRollup -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (07/05/2000 09:16:03 AM). Error: Append to table nh_dlg1s_962337599 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 17 rows not copied because duplica te key detected. ). ----- Scheduled Job ended at '07/05/2000 09:16:17 AM'. Customer: NH 4.5.1 Patch 13 D09 Files and database can be found: //voyagerii/tickets/36000/36807 7/10/2000 1:04:42 PM rtrei Loaded this and ran nhiDialogRollup. Failure point is at the same point I saw 2 weeks ago: there are several very tiny nh_dlg0 tables at the start of the list. and the table has already been rolled up. Asked Jose to get the creation date of the tables from the customer. Will also be talking to Brad about this. Customer workaround is to drop the tables. I've givent he list to Jose, and he should be able to work getting the customer back up again. 7/10/2000 2:22:04 PM rtrei Help output was not as complete as I would have liked, but it did confirm that the weird tables were created after the fact. (They were created on Jun 2 and held data for May 30). 7/27/2000 5:14:35 PM rtrei closing as call ticket closed 7/6/2000 6:15:58 PM mmcnally Getting sql error emailed to her after a fetch database. SQL Error: E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Wed Jun 28 12:45:37 2000) SQL Error: E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Wed Jun 28 12:45:40 2000) This was fixed with the cleanStats script. Bugged due to this happening on NH 4.6 P03 D03 8/30/2000 10:39:03 AM rtrei Requesting database. Need new database if problem still occurring. (Before cleaned up with cleanStats) or copy of an old database before problem was fixed in order to do anything. 9/5/2000 12:03:00 PM rkeville -----Original Message----- From: Monica.Marie.Thompson@census.gov [mailto:Monica.Marie.Thompson@census.gov] Sent: Tuesday, September 05, 2000 11:23 AM To: Keville, Bob Subject: Re: Concord Escalated Call Ticket - 36681 - E_US1592 INDEX: table cou ld not be indexed because rows contain duplicate keys. Bob, Is this a really old ticket? I don't remember this. Why dont we just close it for now. I am working on duplicates with Jose Poblete and Tony Piergallini. Thanks, Bob, Monica ############################################################### 7/7/2000 10:54:40 AM jnormandin Customer machine specs: NT 4.0 running NH vers 4.6 P02 D02. Customer's statistics rollups are failing. From system messages: Wednesday, June 28, 2000 10:35:38 Starting job 'Statistics Rollup' . . . (Job id: 100000, Process id: 426). Wednesday, June 28, 2000 10:39:01 Job step 'Statistics Rollup' failed (the error output was written to D:/nethealth/log/Statistics_Rollup.100000.log Job id: 100000). Wednesday, June 28, 2000 10:39:01 Job 'Statistics Rollup' finished (Job id: 100000, Process id: 426) From Statistics rollup log: Begin processing (6/28/2000 10:35:38). Error: Append to table nh_stats1_959486399 failed, see the Ingres error log file for more information (E_CO0048 COPY: Copy terminated abnormally. 0 rows successfully copied because either all rows were duplicates or there was a disk full problem Verified current disk space: 6 GB free. Consulted Robin Trei DB troubleshooting guide. Info to gather: list of all DB tables, MIN and MAX sample time for table nhStats1_959486399. MIN TIME: 959400663 nh_stats0_959403599 nethealth table nh_stats0_959403599_ix1 nethealth index nh_stats0_959403599_ix2 nethealth index nh_stats0_959407199 nethealth table nh_stats0_959407199_ix1 nethealth index nh_stats0_959407199_ix2 nethealth index nh_stats0_959410799 nethealth table nh_stats0_959410799_ix1 nethealth index nh_stats0_959410799_ix2 nethealth index nh_sta< ts0_959414399 nethealth table nh_stats0_959414399_ix1 nethealth index nh_stats0_959414399_ix2 nethealth index nh_stats0_959417999 nethealth table nh_stats0_959417999_ix1 nethealth index nh_stats0_959417999_ix2 nethealth index nh_stats0_959421599 nethealth table nh_stats0_959421599_ix1 nethealth index nh_stats0_959421599_ix2 nethealth index nh_stats0_959425199 nethealth table nh_stats0_959425199_ix1 nethealth index nh_stats0_959425199_ix2 nethealth index nh_stats0_959428799 nethealth table nh_stats0_959428799_ix1 nethealth index nh_stats0_959428799_ix2 nethealth index nh_stats0_959432399 nethealth table nh_stats0_959432399_ix1 nethealth index nh_stats0_959432399_ix2 nethealth index nh_stats0_959435999 nethealth table nh_stats0_959435999_ix1 nethealth index nh_stats0_959435999_ix2 nethealth index nh_stats0_959439599 nethealth table nh_stats0_959439599_ix1 nethealth index nh_stats0_959439599_ix2 nethealth index nh_stats0_959443199 nethealth table nh_stats0_959443199_ix1 nethealth index nh_stats0_959443199_ix2 nethealth index nh_stats0_959446799 nethealth table nh_stats0_959446799_ix1 nethealth index nh_stats0_959446799_ix2 nethealth index nh_stats0_959450399 nethealth table nh_stats0_959450399_ix1 nethealth index nh_stats0_959450399_ix2 nethealth index nh_stats0_959453999 nethealth table nh_stats0_959453999_ix1 nethealth index nh_stats0_959453999_ix2 nethealth index nh_stats0_959457599 nethealth table nh_stats0_959457599_ix1 nethealth index nh_stats0_959457599_ix2 nethealth index nh_stats0_959461199 nethealth table nh_stats0_959461199_ix1 nethealth index nh_stats0_959461199_ix2 nethealth index nh_stats0_959464799 nethealth table nh_stats0_959464799_ix1 nethealth index nh_stats0_959464799_ix2 nethealth index nh_stats0_959468399 nethealth table nh_stats0_959468399_ix1 nethealth index nh_stats0_959468399_ix2 nethealth index nh_stats0_959471999 nethealth table nh_stats0_959471999_ix1 nethealth index nh_stats0_959471999_ix2 nethealth index nh_stats0_959475599 nethealth table nh_stats0_959475599_ix1 nethealth index nh_stats0_959475599_ix2 nethealth index nh_stats0_959479199 nethealth table nh_stats0_959479199_ix1 nethealth index nh_stats0_959479199_ix2 nethealth index nh_stats0_959482799 nethealth table nh_stats0_959482799_ix1 nethealth index nh_stats0_959482799_ix2 nethealth index MAX TIME: 959486173 Differance = 69 tables. whattime 959400663: Sat May 27 00:11:03 20000 whattime 959486173: Sat May 27 23:56:13 20000 more than 24 tables between them and times do not match 00:59:00 and 23:59:59. Ticket will be escalated per DB troubleshooting Doc. 7/10/2000 1:06:31 PM rtrei sent mail to Support 7/10/2000 2:05:39 PM rtrei This is the standard stats1 duplicate problem. The Suppport person counted indices as well as regular tables and used sample_time instead of range time. I will update the worksheets to avoid this confusion in future. Meanwhile, told support to go ahead and drop the stats1 table and the rollup should go through fine. Do not expect to work on this again unless hear that an unexpected problem occurred. 7/12/2000 12:40:30 PM rtrei Just waiting to hear back from Support on status 7/25/2000 3:24:13 PM rtrei closing as call ticket closed.  7/7/2000 2:54:14 PM jnormandin Customer machine info: Solaris 2.6 NH vers 4.6 P03 D02. Customer has had reccuring rollup failures. ( Tickets 34483, 35124 and 36822). Installed P03 when it became availabe but are still experiencing them. - From system messages: Monday, July 03, 2000 07:00:06 PM Starting job 'Statistics Rollup' . . . (Job id: 100000, Process id: 2587). Monday, July 03, 2000 07:04:44 PM A scheduled poll was missed, the next poll will occur now (Statistics Poller). Monday, July 03, 2000 07:06:50 PM Job step 'Statistics Rollup' failed (the error output was written to /opt/nethealth/log/Statistics_Rollup.100000.log Job id: 100000). Monday, July 03, 2000 07:06:50 PM Job 'Statistics Rollup' finished (Job id: 100000, Process id: 2587). Statistics Rollup log: Begin processing (07/03/2000 07:00:07 PM). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Mon Jul 3 22:06:46 2000) ). Output from nhiIndexDiag: Problem with Index . Error: Index 'nh_stats0_961775999_ix1' was not not in the database. Duplicate problem: Found 18 duplicates out of 47725 rows for index nh_stats0_961775999_ix1 on table nh_stats0_961775999. Saving duplicates in '/opt/nethealth/tmp/session.nh_stats0_961775999.dat'. Problem with Index . Error: Index 'nh_stats0_961775999_ix2' was not not in the database. Duplicate problem: Found 18 duplicates out of 47725 rows for index nh_stats0_961775999_ix2 on table nh_stats0_961775999. Saving duplicates in '/opt/nethealth/tmp/session.nh_stats0_961775999.dat'. Customer's ingres error log contains multiple QEF errors from the dates of April 6 to Jun 22, but no errors listed on July 3 ( day of rollup failure) Seeing as though there were only 36 duplicates listed, I had them run the clean stats script. This was successfull and the rollups were also successfull. Customer is concerned about the re-occurance of these duplicates as well as the inability of patch 3 to prevent these in this situation. 7/10/2000 1:07:28 PM rtrei Awaiting database. Tony says problem stopped happening after they adjusted their system to not miss polls. He will work with them on tuning it. Meanwhile, Engineering needs to look into what is causing this. 7/13/2000 6:32:30 PM rtrei Asking Support if database has arrived. 7/25/2000 3:45:16 PM rtrei Call log says they are still trying to get the database. 9/11/2000 10:29:08 AM apier When I last spoke to Dave Shepard he mentioned that he had atg least 2 DB's from customers experiencing the duplicate problems. He thought anothe< r DB was not going to be very helpful and that he needed to examine the code. I have not collected the DB. Tony 9/14/2000 11:21:38 AM rtrei This is waiting pending hearing more from Dave Shepard as to any ideas he wants to do with these. 10/27/2000 8:49:10 AM apier No response from customer for 30 days after sending the poller executable provided by Dave Shepard. Closing the call 7/17/2000 5:37:31 PM jpoblete Customer: Morgan Stanley Cust. Sensitivity: YES Problem: Customer is trying to Rollup his DB, but the process is hanging, has been running several hours, without significant CPU utilization. They were having the error: NH_HOME/bin/sys/nhiRollupDb -now 4/14/00 Begin processing (07/17/00 22:34:57). Error: Unable to execute 'DELETE FROM nh_stats0_955238399 WHERE sample_time <= 955205999' (E_US1264 The query has been aborted. (Mon Jul 17 10:47:44 2000)). From Remedy, we attempted first to stop all Network Health proc's and then stop ingres then resized the transaction log and try again the rollup with little success Initially they were experiencing this when issuing the normal rollup: nhiRollupDb then we startted the rollups using the option -now and try weekly rollups which worked and freed up some space (went from 98% to 80% on partition space used) then if failed with the same error and tried to perform daily rollups but now it's hang for the rollup on April 12th. We need to get customer's DB rolled-up before upgrading him to Network Health 4.1.5 7/17/2000 5:39:07 PM jpoblete Pd. The ingres errlog.log only shows a lot of messages like the following: ::[3451 , 00AD6040]: Mon Jul 17 15:53:37 2000 W_DM5422_IIDBDB_NOT_JOURNALED WARNING: The iidbdb is being opened but journaling is not enabled; this is not recommended. 7/18/2000 11:47:38 AM jpoblete Finally, the rollup for the day 04/12/00 failed: -----Original Message----- From: Binh Ho [mailto:Binh.Ho@msdw.com] Sent: Tuesday, July 18, 2000 9:36 AM To: support@concord.com; jopbele@concord.com; Nicholas R Hook Subject: case # 37190 Jose, nhiRollupDb failed for 04/12/2000 Thanks Binh ebems2 /u/nethealth/current 73# $NH_HOME/bin/sys/nhiRollupDb -now 4/12/00 Begin processing (07/18/00 02:30:05). Error: Unable to execute 'DELETE FROM nh_stats0_955238399 WHERE sample_time <= 955205999' (E_US1264 The query has been aborted. (Mon Jul 17 16:26:07 2000) ). 7/18/2000 12:58:41 PM rtrei Jose-- I still think they are blowing their transaction log. (Their errlog.log shows they blew their transaction log since times since last Friday and Sunday.) It can take a lot of time to rollback if a transaction log is blown, so I suspect that is what is making the poller appear to hang. My recommendation is to increase the transaction log to 1 gb or better as space allows. An alternate recommendation is to delete the nh_stats0_955205999 and nh_stats0_955238399 and see how far rollups go before having problems again. (It could be that the above table is unusually large for some reason or other.) Make sure you delete them properly and remove them from the nh_rllp_boundary table as well. Assigning status to MoreInfo until hear back how recommendation worked out. 7/21/2000 12:41:26 PM don tables were deleted and rollups are running successfully. C7/19/2000 4:03:41 PM rrick $NH_HOME/bin/sys/nhiDialogRollup -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (07/19/2000 08:05:38). Error: Append to table nh_dlg1b_959065199 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 445 rows not copied because duplic ate key detected. ). ----- Scheduled Job ended at '07/19/2000 08:05:43'. ----- Version 4.6 D03 P03 7/20/2000 2:53:10 PM rtrei NH4.6 p 3 should have contained a new nhiDialogRollup that missed getting into the patch. (It will get into patch 5.) I have put a solaris version of this out on the ftp outgoing site as nhiDialogRollup.Z If this does not solve the problem, then we will need to get the database inhouse so that I can debug with it. 7/21/2000 2:56:14 PM rtrei Setting this to MoreInfo until I hear from SUpport regarding results. 7/27/2000 9:53:53 AM don -De escalated due to customer not being able to install the patch for 2 weeks. 12/13/2000 10:53:11 AM pkuehne Closed pending more information...Peggy Anne Kuehne 12/13/00 7/24/2000 10:23:40 PM drecchion Roll ups failing with "Schedule poll was missed next poll will occur" error in the system log. Per Tony P- Robin Trei is waiting for a database failing roll ups with this error message Database- syslog- error log-and roll up log is in escalated tickets on voyagerII 37000/37460 7/27/2000 9:50:51 AM rtrei Loaded the database, started to investigate. Currently, I estimate this will take about 3d as there is a lot of investigation to do, and results are unknown 8/3/2000 2:42:49 PM jpoblete Robin, Customer reached today 100% disk utilization. Asked to stop the Nethealth proc's . could not connect to nethealthDB succesfully Ingres is in a different partition than Nethealth 8/4/2000 5:31:39 PM rtrei Talked with Jose. Because of the wording of the original ticket, both Support and I was confused about what the other was doing. We did not coordinate our next steps. At this point, Jose is getting the cusotmer back up and running. He is fixing the duplicates, getting the db to rollup, and resetting the poller timer so it won't miss polls. Then we need to monitor the situation to be sure no other duplicates appear. Beyond that, we should n't need anyhting more from this customer. I have loaded the database and will pass it on to or work with Dave Shepard to investigate it. 8/4/2000 7:35:33 PM jpoblete -----Original Message----- From: Poblete, Jose Sent: Friday, August 04, 2000 7:25 PM To: Trei, Robin Cc: Gray, Don; Piergallini, Anthony Subject: RE: 9911 / 37460 Robin, I ran the cleanStats script and then the Rollup finished OK. Then Moved back the moved files to their original location and attributes. Then started the Nethealth Proc's. without problem. But before start the nethealth proc's I ran the command sysmod and got the following: tconoc09% sysmod nethealth Sysmoding database 'nethealth' . . . Modifying 'iiattribute' . . . E_US1200 Table name is not valid. (Fri Aug 4 19:17:52 2000) Sysmod of database 'nethealth' abnormally terminated. I believe the DB is corrupted and customer should save/ destroy/ create/ reload the DB and maybe install your Ingres patch. Please let me know if I'm correct. JMP 8/7/2000 12:27:47 PM rtrei Yes, you are correct. As we discussed over the phone, using symbolic links ot move a database is not a good idea. Good thing the customer started with this. 8/9/2000 10:37:14 AM rtrei This duplicate data has been collected and passed onto Dave Shepard. 8/10/2000 2:43:06 PM rtrei Reassigning to Dave Shepard as his analysis is the next step. 9/5/2000 4:07:04 PM dshepard Requested more info on this problem from customer with ticket 9405. 9/12/2000 4:56:18 PM dshepard I've reached the limit of what I can figure out with just looking at database records. I need to get some more information from the customers that are experiencing this problem. It appears to me that there is some correlation with config changes. My hope is that this excercise will shed some light on the root cause of the problem and allow us to fix it. I have built a new version of the poller based on the latest NH 4.6 patch (patch 5). This version will dump poller checkpoints when it sees what looks like a duplicate record being written to the database. Here is what I'd like the customers to do: 1) Upgrade to the latest 4.6 patch. 2) Delete any poller checkpoint files in $NH_HOME/tmp/pollerHistory.out 3) Install the new poller executable from the Concord ftp site. I have placed executables for both Solaris and HP as follows: /ftp/outgoing/nhiPoller.sol.dups /ftp/outgoing/nhiPoller.hp.dups 4) Stop and resta< rt the servers 5) Check back once or twice a day for a new file at $NH_HOME/tmp/pollerHistory.out If you see that file, then the poller has detected what it thinks may be a duplicate record. At that point, do the following: 1) Get the listdups script from the Concord ftp site in /ftp/outgoing 2) To run this script, the user must log in as nhuser, cd to NH_HOME, source their nethealthrc.csh file, and then execute the 'listdups' script. Once that runs, send the following information to Concord tech support. 1) a tar'ed or archived copy of all the files and directory structures in $NH_HOME/tmp and $NH_HOME/log 2) a saved system log 3) $NH_HOME/poller.cfg file Just FYI, the reasons for needing the above information is as follows: I wish to correlate the instances of duplicate data with config changes. That requires the system log to show when config changes occurred, the output of listdups to show when the duplicate data got inserted, and then the tmp and log directories to show what config changes occurred to which elements. Then I hope to use the poller.cfg file and the checkpoint dump from the poller to trace the control paths that led to the duplicate data for specific elements. 10/3/2000 5:59:47 PM dshepard I have placed a new Poller executable on the outgoing website that I believe will fix the duplicate data problems in the stats0 tables. Will wait to hear results. Email sent to responsible parties in tech support. 10/6/2000 10:53:02 AM rsanginario Tuesday, October 03, 2000 5:59:47 PM dshepard I have placed a new Poller executable on the outgoing website that I believe will fix the duplicate data problems in the stats0 tables. Will wait to hear results. Email sent to responsible parties in tech support. 11/6/2000 2:09:57 PM bhinkel Changed to new Field Test state until customer approves fix. 11/6/2000 4:27:14 PM dshepard This fix has been in Field Test for 5 weeks with no repeat of the problem. I have three days left in order to get it approved for the next patch. Otherwise it has to wait over a month longer. My suggestion is therefore to put it into the next patch. There are too many customers seeing the problem to delay it another patch cycle. 11/7/2000 10:44:44 AM bhinkel Changed to Field Test until Tribunal reviews on 11/09 and fix gets checked in. 11/10/2000 5:26:27 PM dshepard Duplicate of 9234 7/25/2000 1:18:39 PM foconnor Customer has experienced statistic rollup failures, statistic index failures and the sysmod.log shows that the iiatribute table is corrupt. Sysmoding database 'nethealth' . . . Modifying 'iiattribute' . . . E_QE008D Error trying to modify core system catalogs: iirelation,iiattribute, iiindexes, ii_relidx, iidevices. (Mon Jul 10 03:59:05 2000) Sysmod of database 'nethealth' abnormally terminated Files: //voyagerii/tickets/37000/37182/July25 Network Heatlh 4.5.1 Patch 10 D07 Solaris 2.6 8/3/2000 3:21:43 PM rtrei The associated call ticket is closed. The data was stored away for reviewing, any additional analysis. \7/28/2000 2:03:58 PM tcordes This problem was seen by Professional Services in the Training Lab. If the ingres user tries to run the nhMvCkpLocation script without sourcing the nethealthrc.csh, this error ensues: ED-SER53% ./nhMvCkpLocation -p /training/nh46/db/chkpnt -n nethealth_ckp nethealth Creating checkpoint location nethealth_ckp . . . Unable to create directory /training/nh46/db/chkpnt/ingres/ckp/default/nethealth Unable to create directory /training/nh46/db/chkpnt/ingres/dmp/default/nethealth Unable to create directory structure for checkpointing. Per ahamlin, scripts in the $NH_HOME/bin/ dir should source the nethealthrc.sh file; the user should not have to. 7/31/2000 3:06:38 PM rtrei This should be in your bailiwick now :> 1/17/2001 10:31:10 AM wzingher Setting target release 5.5, status to assigned. 3/1/2001 8:18:23 PM rtrei Reassigning to Yulun for 5.5. release. estimate 1 day 4/20/2001 8:48:44 AM pkuehne Changed Assigned Priority to medium 5/3/2001 4:24:35 PM pkuehne Declined works as specified. Part of the 5.5 Bug Triage. Peggy Anne Kuehne 8/1/2000 5:18:14 PM snorman Standard error for Duplicate keys in conversations database. Asked for error log to show exact table. DB available. 4.5.1 P13 8/1/2000 5:22:36 PM mjc customer is ISIS. 8/2/2000 9:42:05 AM rtrei Steph-- I looked over on ftp.incoming and saw an isis_db_300700.tar. I grabbed that and it loaded ok. Looked complete. I presume it was the entirety of the 5 split tars. THe load of the database looked good, ran nhiDialogRollup-- it completed without problems. It did take almost 6 hours, so presumably something is going on at the customer site, but I need more info to determine what. Please get the following: 1. The actual error message when nhiDialogRollups is failing. If it isn't in the log, see if they can run it manually once in order to capture it. 2. ls -l $NH_HOME/bin/sys > NH_files.out 3. echo "help\g" | sql nethealth > nh_tables.out 4. run the inclosed script: nhCollectCustData.sh It will create a file in $TMPDIR/dbCollect.tar, please get that for me as well. As usual, please have them accomplish the above by logging in as nhuser, cd'ing to $NH_HOME, and sourcing nethealthrc.csh thanks, Robin 8/2/2000 2:54:04 PM rtrei Rollup ran fine here on solaris system. Doublechecked size, date of nhiDialog Rollup on their system with that of our patch-- they match. Double checked the files in their database vs the files in the database I loaded. (except for a few new tables created after the database, they match.) Nothing in the errlog.log to indicate problems. Next step is to grab an HP system and see if it is something HP specific. 8/3/2000 9:51:41 AM rtrei Reasigning to Yulun for the following: 1. Make sure HP system has latest 4.6 patch installed 2. log into our ftp site and get isis_db_300700.tar.gz 3. Load it onto the database. 4. run bin/sys/nhiDialogRollup -u nhuser -d nethealth Report back whether it runs to completion or not. (This succeeded on solaris, we are now checking if there is something specific to HP that is causing this problem for the customer.) 8/8/2000 4:12:51 PM yzhang Sorry for taking such a time for this problem. because the database loading and running nhiDialogRollup combine takes about 12 hours. our test of running nhiDialogRollup on HP looks fine, and I am doing the same test agian for double check today. The result will come out tommorrow morming. Yulun 8/9/2000 10:21:55 AM yzhang Stephanie, we used the new nhiDialogRollup, and it running ok on hp. Can you have customer to grap this executable from ~ftp/outgoing in ftp.concord.com. and run the nhiDialogRollup again. Also please remind the customer to unlimit the stack size in their nethealth source file. Yulun 8/9/2000 10:44:15 AM yzhang This is the history about how we taking care this problem. 1) we installed nh45, then patch13 on one of the HP. 2) we loaded database 300700.tdb from ~ftp/incoming in the ftp.concord.com site. the loading is succeeded. Then we run nhiDialogRollup -u nhuser -d nethealth, we got following error: Error: Append to table nh_dlg1s_963784799 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 975 rows not copied because duplicate key detected. 3) Robin Trei has somebody rebuild nhiDialogRollup, which presumely included some of the change. 4) we loaded the database again and run the new nhiDialogRollup. the running is succeeded. for double check we did the db loading and running nhiDialogRollup two times. It succeeds each time. 5) Now we have put the new executable nhiDialogRollup in ~ftp/outgoing, and the customer can grab it and run in their HP system. 8/10/2000 9:33:11 AM rsanginario Updated this to fixed. Fixed in field test Wednesday, August 09, 2000 10:44:15 AM yzhang This is the history about how we taking care this problem. 1) we installed nh45, then patch13 on one of< the HP. 2) we loaded database 300700.tdb from ~ftp/incoming in the ftp.concord.com site. the loading is succeeded. Then we run nhiDialogRollup -u nhuser -d nethealth, we got following error: Error: Append to table nh_dlg1s_963784799 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 975 rows not copied because duplicate key detected. 3) Robin Trei has somebody rebuild nhiDialogRollup, which presumely included some of the change. 4) we loaded the database again and run the new nhiDialogRollup. the running is succeeded. for double check we did the db loading and running nhiDialogRollup two times. It succeeds each time. 5) Now we have put the new executable nhiDialogRollup in ~ftp/outgoing, and the customer can grab it and run in their HP system. 8/10/2000 11:12:52 AM rsanginario Wednesday, August 09, 2000 10:44:15 AM yzhang This is the history about how we taking care this problem. 1) we installed nh45, then patch13 on one of the HP. 2) we loaded database 300700.tdb from ~ftp/incoming in the ftp.concord.com site. the loading is succeeded. Then we run nhiDialogRollup -u nhuser -d nethealth, we got following error: Error: Append to table nh_dlg1s_963784799 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 975 rows not copied because duplicate key detected. 3) Robin Trei has somebody rebuild nhiDialogRollup, which presumely included some of the change. 4) we loaded the database again and run the new nhiDialogRollup. the running is succeeded. for double check we did the db loading and running nhiDialogRollup two times. It succeeds each time. 5) Now we have put the new executable nhiDialogRollup in ~ftp/outgoing, and the customer can grab it and run in their HP system. 8/13/2000 6:40:16 PM yzhang Stephanie, For double check if we used the right executable for nhiDialogRollup. I have put a executable called good_nhiDialogRollup.Z in our ftp site. Can you have customer to grap this executable from ~ftp/outgoing in ftp.concord.com. and run the nhiDialogRollup again. Also please remind the customer to unlimit the stack size in their nethealth source file, and make sure their number of bytes transfered is correct during the ftp process. Thanks Yulun 8/14/2000 9:00:01 AM yzhang I reloaded the database on HP, the loading is succeeded, Then I went to fx_hp view, get the new nhiDialogRollup(built in Agu. 4), and rerun it. again it is succeeded. and the good_nhiDialogRollup.Z is now in our ftp/outgoing directory. I have told stephanie have customer to grab it and rerun. 8/16/2000 10:59:59 AM yzhang The new nhiDialogRollup has been built with print information. Now I am reload the customer database, then test this new nhiDialogRollup on HP machine. I will put the new nhiDialogRollup in our ftp/outgoing site after the successful test. Yulun 8/18/2000 5:36:44 PM yzhang Yulun and Michael have a dialog rollup running on the customer DB. We think we have the problem resolved and the rollup has made it far past the point of failure. We will let the rollups complete over the weekend and on Monday, assuming that they ran OK Stephanie can ship the customer the attached script. The script will clean the DB and allow rollups to complete. Note that the rollups will take a long time to run ~6-12 hours depending on the machine. To run the script: (1) disable dialog rollups through the job scheduler UI. (2) as the nh admin source the nethealthrc.csh file (3) ./esc9966.sh (4) cd $NH_HOME/bin/sys (5) ./nhiDialogRollup -u -d When rollups complete they can be turned back on through the scheduler. 8/19/2000 4:18:39 PM yzhang Stephanie, The rollups complete succesfully. you can ship the script to customer and they run according to the following. 8/29/2000 9:50:11 AM yzhang the rollup succeeded, and ticket closed L8/2/2000 11:02:24 AM rkeville Database status GUI shows all zeros intermittently. - NH 4.6 P03/D04 - NT 4.0 - Dual PIII 400's. - 1 GB RAM. - 1 GB SWAP nhDbStatus works fine but if you try to get a db status from the GUI sometimes it shows all zeros. They cant duplicate this at will. Reported by Rob Jarvis. ########################################################### 8/16/2000 6:11:49 PM rkeville Appears to have been resolved after removing a huge number of node addr pairs from the database. ########################################################## 8/2/2000 11:14:27 AM rrick $NH_HOME/bin/sys/nhiSaveDb -u $NH_USER -d $NH_RDBMS_NAME -ckp ----- Mon Jun 12 02:28:49 2000 CPP: Preparing to checkpoint database: nethealth Mon Jun 12 02:28:49 2000 CPP: Preparing stall of database, active xact cnt: 0 Mon Jun 12 02:28:49 2000 CPP: Finished stall of database Mon Jun 12 02:28:51 2000 CPP: Deleting non-database file: path = '/opt/nethealth/idb/ingres/data/default/nethealth', file = 'zzzz002 6.ali' beginning checkpoint to disk /opt/nethealth/db/save/checkpoint/ingres/ckp/default/nethealth of 1 locations Mon Jun 12 02:28:51 2000 CPP: Start checkpoint of location: ii_database to disk: path = '/opt/nethealth/db/save/checkpoint/ingres/ckp/default/nethealth' file = 'c0027001.ckp' executing checkpoint to disk /bin/sh: /bin/cp: The parameter list is too long. Mon Jun 12 02:28:52 2000 E_DM1101_CPP_WRITE_ERROR Error writing checkpoint. Mon Jun 12 02:28:52 2000 E_DM110B_CPP_FAILED Error occurred checkpointing the database. Begin processing (06/12/2000 01:28:28 AM). Copying relevant files (06/12/2000 01:28:32 AM). End processing (06/12/2000 01:28:52 AM). ----- 8/3/2000 3:36:10 PM rtrei Yulun-- This isn't an escalated ticket, but should be worked as a fairly high priority. Try to spend a 2-3 hours on it next week. Mike A will work with you on it. 8/8/2000 1:11:30 PM yzhang studying the problem 8/10/2000 10:30:08 AM yzhang waiting the rollup.log from customer 8/16/2000 3:14:04 PM yzhang Russell, I looked at the escaleted folder for this ticket, and I noticed some information are still missing. I think I need: 1) a list of tables in the database, and log file for stats rollup. Thanks Yulun 8/17/2000 10:59:21 AM yzhang Asked Russell to see if he can get the customer database so I can do the test on our hp system for the checkpoint saving. 8/17/2000 5:55:32 PM yzhang Russell, Attached is cktmpl.def file, please have customer do the following: 1) replaces this one with $NH_HOME/idb/ingres/files/cktmpl.def. 2) change the owner of the file to ingres. 3) run nhSaveDb -ckp nethealth. Also please mention to the customer that nhMvCkpLocation only need to be run at once. They have already created the loaction, so they can directly run nhSaveDb -ckp nethealth after replacing the file. 8/18/2000 10:56:29 AM rsanginario Thursday, August 17, 2000 5:55:32 PM yzhang Russell, Attached is cktmpl.def file, please have customer do the following: 1) replaces this one with $NH_HOME/idb/ingres/files/cktmpl.def. 2) change the owner of the file to ingres. 3) run nhSaveDb -ckp nethealth. Also please mention to the customer that nhMvCkpLocation only need to be run at once. They have already created the loaction, so they can directly run nhSaveDb -ckp nethealth after replacing the file. CHANGED STATUS TO MOREINFO 8/22/2000 10:49:32 AM yzhang their checkpoint save succeeds, but copy files failed. asked customer to change the permission on ckp.tdn directory, then do the checkpoint save again. 8/24/2000 10:47:08 AM rrick 08/24/00 Yulun, Changing permissions and setting the system into the c-shell corrected their issue. We can check in the fix to allow a customer to carry a larger than 2 gig database. 8/28/2000 11:17:10 AM rsanginario Yulun, Changing permissions and setting the system into the c-shell corrected their issue. We can check in the fix to allow a customer to carry a larger than 2 gig database. Changed status to fixed, field test. !8/8/2000 2:19:53 PM snorman She was on 4.1 and she then went to 4.1.5. Now we are trying to her her to 4.6 < and the database is now converting. She recieves the following: Converting database nethealth Segmentation Fault Converting a prior version 4.1 database . . . Begin processing Copying new fields into the Database (08/08/2000 01:58:43 PM). Segmentation Fault The database nethealth has not been converted. You will not be able to run Network Health with this database. logs on voyager in the escalated tickets directory. 8/9/2000 10:14:10 AM rtrei Yulun-- 1. Talk with Steph Norman and see if she needs help getting the customer up and running. I recommend that we try 4.5 for this customer and see if that works. 2. Review the logs and see if you can pinpoint where the problem occurrred. 3. Try to duplicate the problem in the lab. I believe you have another (lower priority) 4.1.5 to 4.6 conversion problem assigned to you already. If so, you should look at that at the same time. 8/9/2000 2:17:38 PM yzhang it looks not very clear to me regarding what exactly the customer is doing. my understanding is that the customer want to convert 4.1.5 database to 4.6 and got Segmentation Fault. Can you have customer try the conversion from 4.5 to 4.6. Let me if you need help for doing this Yulun 8/10/2000 10:38:52 AM yzhang waiting customer 4.1.5 database 8/10/2000 11:59:07 AM jpoblete Yulun We have the DB and the DB load fails, but no error is in the load.log Tell us where you need it and we will send you the tar file 8/10/2000 12:15:40 PM jpoblete -----Original Message----- From: Poblete, Jose Sent: Thursday, August 10, 2000 12:04 PM To: Zhang, Yulun Cc: Norman, Stephanie; Gray, Don Subject: Escalated Ticket 10012 / 37419 Importance: High Yulun, The customer DB is located in ftp://ftp.concord.com/outgoing/37419.tdb.tar We attempted to load the DB in one of our Solaris servers, the load failed, however, there was NO error message in load.log, only in the system log: "Database Load Failed, But polling can continue" After that, Nethealth proc's fall in a loop of stopped unexpectedly and restarting. Please take a look to this since this is a escalated customer sensitivity issue. Thank You, J. M. Martinez Poblete, Senior Support Engineer 8/10/2000 5:54:57 PM yzhang reproduced customer problem, now doing debugging to find the fix 8/11/2000 12:20:19 PM yzhang it looks the problem is coming from polling. this is the error message from dbx. I am wondering if it's propariate for you to look at this problem. signal SEGV (no mapping at the fault address) in CuTknStream::findDelim at line 1270 in file "cuTokenizer.C" License : Re-connected to the license server for Sun WorkShop dbx SPARC after 1 retries signal SEGV (Segmentation Fault) in _libc_kill at 0xef60828c 0xef60828c: _libc_kill+0x0008: bgeu _libc_kill+0x30 Current function is CuTknStream::findDelim (dbx) The customer DB is located in ftp://ftp.concord.com/outgoing/37419.tdb.tar. Thanks Yulun 8/15/2000 6:01:22 PM yzhang Stephanie, With Michael's help, we have found that some nmskeys in Poller.cfg have large number of characters as the key length, which causes the segmentation fault. We need script for cleaning the nmkkey inm the poller.cfg file. Do you know or any one of you knows where we can get the script. This is the problem happened in the past. Thanks Yulun 8/17/2000 10:54:42 AM don -----Original Message----- From: Felicia Artis [mailto:Felicia.Artis@harbinger.com] Sent: Wednesday, August 16, 2000 12:49 PM To: 'Gray, Don' Cc: Norman, Stephanie Subject: RE: Ticket 37419 Don, Thank you for getting us back on-line again. I was able to load the database with this poller.cfg file. I still should stress that we reported this problem 2 weeks ago, and have been completely down ever since then. Thanks, Felicia Felicia Artis Systems Administrator 8/17/2000 10:57:09 AM yzhang It's ok now after cleaning the poller.cfg file 8/11/2000 7:46:22 PM drecchion Roll ups are failing due to duplicates. This ticket is being bugged and escalated due to the fact this customer is presently running patch 3. nhIndexDiag shows multiple duplicates. All relative files are in the escalated tickets directory on voyagerII 38000/38161 8/30/2000 10:43:20 AM rtrei Russell-- Please check with the customer if the problem is still occurring. If yes, please get a copy of the database before running the cleanStats script. If no, see if you can get an old copy of the database with the duplicates still present. We can't do any analysis without the data. thanks, Robin 12/13/2000 11:00:27 AM pkuehne Closed pending more information. Peggy Anne Kuehne 12/13/00 78/15/2000 5:33:20 PM jnormandin Statistic Rollup failure. From Statistic_rollup.log: --- Job started by Scheduler at '08/07/2000 20:30:45'. $NH_HOME/bin/sys/nhiRollupDb -u $NH_USER -d $NH_RDBMS_NAME Begin processing (08/07/2000 20:30:47). Error: Append to table nh_stats1_965087999 failed, see the Ingres error log file for more information (E_CO0048 COPY: Copy terminated abnormally. 0 rows successfully copied because either all rows were duplicates or there was a disk full problem. df -k output ( NH and DB are on /opt filesystem) Filesystem kbytes used avail capacity Mounted on /dev/dsk/c0t0d0s0 464695 108716 309510 26% / /dev/dsk/c0t0d0s1 464695 404883 13343 97% /usr /proc 0 0 0 0% /proc fd 0 0 0 0% /dev/fd /dev/dsk/c0t0d0s7 2107063 1585008 458844 78% /export /dev/dsk/c1t5d0s0 52128848 8652840 42954720 17% /opt swap 1015136 40 1015096 1% /tmp nhiIndexDiag command revealed no duplicates. db status output: nipr-stats% nhDbStatus nethealth Database Name: nethealth Database Size: 2468954000.00 bytes RDBMS Version: OI 2.0/9712 (su4.us5/00) Location Name Free Space Path +-------------------+------------------+---------------------------------+ | ii_database | 42954895000.00 bytes | /opt/ingres | +-------------------+------------------+---------------------------------+ Statistics Data: Number of Elements: 5285 Database Size: 2434592000.00 bytes Location(s): ii_database Latest Entry: 08/08/2000 18:08:06 Earliest Entry: 10/02/1999 00:00:00 Last Roll up: 08/07/2000 20:39:03 - errlog.log shows no entries made for that day - verifydb command lists no problems - searched all *.log files in the $II_SYSTEM/ingres/files direcorty - no entries made matching date stamp of 8-7 8/18/2000 7:49:11 PM apier Rollups are now working. Duplicates cleaned up. Stats1 duplicate problem. 8/22/2000 5:05:22 PM rkeville Scheduled jobs were lost after save and load. - They have an old save that should have the sheduled jobs in it. #################################################### 8/23/2000 11:01:20 AM czarba Jobs actually seem to be running - which means that the schedules are intact, but not showing up in the GUI. Mike A. suggested bouncing the server which cleared up the problem 88/24/2000 1:26:57 PM mmcnally BUG VERSION: NH 4.6 P00 D00 O/S: NT SHORT DESCRIPTION: Command line nhSaveDb will delete all files in drive due to small syntax error DETAILED DESCRIPTION: Customer ran nhSaveDb from the command line and it deleted all the files in his C: drive. During a a database save when the does not exist, the nhsavedb command erases all files & directories at the upper level of the path (eg C:\nethealth-save is the expected but non existing directory, all the information located at the C:\ level is deleted !!) Customer command syntax was wrong. Documentation is correct. Customer is aware that he created the problem with bad syntax, and ticket 36935 was closed on this issue. This problem needs to be bugged because a small command line syntax problem should not cause the major problem of erasing all the files in a d< irectory. issue raised by Alain Prevoteau aprevoteau@concord.com in behalf of Renato Vista (reseller ARCHE in France) 8/30/2000 10:28:41 AM rtrei Yulun-- I know Mike A did some work on this problem. We need to determine if 1. The problem was fixed in a later patch and customer should just apply the patch. 2. The problem was fixed in a later version, and we should merge (or move) the fix to this version. 3. The fix still needs to be tweaked. 8/31/2000 1:26:38 PM yzhang Mike, This problem was assigned to me, but I could not reproduce the problem. I did nhSaveDb on the target_directory (which is not exist), and the save is succeeded, and nothing got araised. I did this on 4.6 on NT. Do you thing I missed something. Thanks 8/31/2000 2:34:16 PM yzhang Yulun, This problem was fixed in the 4.7.1. release. Thanks, -Mike 9/12/2000 5:55:15 PM yzhang The code change has been ckecked in 8/28/2000 4:25:28 PM dkrauss Conversations rollup failing with 'Append to table nh_dlg1b_...' error ERROR: Begin processing (8/25/2000 12:55:28 PM). Error: Append to table nh_dlg1b_950936399 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 153 rows not copied because duplicate key detected. ). DBTroubleshooting doc specifies - If the customer is on 4.5, then we need to get them a new executable. (The new nhiDialogRollup did not go out in the patch planned for it, it is awaiting going out in a next patch. Until that time, we need to deliver a new executable.) Have Dave Andrews build a new cdb library and nhiDialogRollup in the build view correct for the customer's platform. (fx_hp, fx_sol, fx_NT). Customer is on NT4.0 sp5 NH 4.5.1 p11 d08 Files received from customer- Conversations Rollup log errlog.log iiacp.log iircp.log system log All files located on //voyagerii/Escalated Tickets/38000/38872 9/7/2000 10:00:05 AM rtrei sent new executable to support. Waiting to hear results. 9/19/2000 3:55:08 PM jpoblete Robin, Customer tried the new executable, but conversation rollups still fails: ----- Job started by Scheduler at '9/19/2000 12:50:57 PM'. ----- ----- $NH_HOME/bin/sys/nhiDialogRollup -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (9/19/2000 12:50:58 PM). Error: Append to table nh_dlg1b_950936399 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 153 rows not copied because duplicate key detected.). ----- Scheduled Job ended at '9/19/2000 12:51:09 PM'. ----- 9/19/2000 3:55:56 PM jpoblete Ooops, Forgot to update the status of this to Assigned. 9/20/2000 10:01:35 AM rtrei Asked for db and the create date information 10/11/2000 7:44:11 PM jpoblete Customer never got back to support, closed the call ticket as not resolved. Up to engineering to close this one. 12/13/2000 11:00:59 AM pkuehne Closed pending more information...Peggy Anne Kuehne 12/13/00 9/5/2000 1:39:55 PM wburke ________________________________________________ SUMMARY: Customer has logged a trouble ticket against the following rollup failures, once a month for the last 3. He would appreciate a stronger Fix to his problem, as this happens consistantly once a month. " We've **already fixed** it, but we want to have a more robust rollup, which doesn't start failing every couple of weeks." 36590 6/27/00 Error: Append to table nh_stats1_960328799 failed. 36981 7/10/00 Error: Unable to execute 'DROP TABLE nh_stats0_962909999' 39152 9/01/00 Error: Append to table nh_stats1_966895199 failed 9/7/2000 10:02:24 AM rtrei Chris: This should be a good learning opportunity :> See me for recent history in this area. -- RT 11/22/2000 1:13:55 PM foconnor -----Original Message----- From: ICS Product Support [mailto:support@ics.de] Sent: Wednesday, November 22, 2000 11:59 AM To: O'Connor, Farrell Subject: Re: Call ticket39152 ICS TICKET ID:001303 (Request 002745) Farrell, > Thank you for that information. There was a Problem Ticket associated with > that ticket (10290)can I close that also? Yes, please do so. I guess we will not be able to reproduce this properly. Jan 9/6/2000 10:35:05 AM schapman ver. 4.5 Patch 11 The error message "Error: Sql Error occured during operation (E_RD0060 Cannot access table information due to a non-recoverable DMT_SHOW error" occurred during a database save. Alll pertinent information has been collected. As per the verifydb I am trying to drop the indices or the tables and then try to save , destroy ,create and load. As per our discussion this may be a candidate for the Ingres patch. All files are contained on voyagerii\tickets\39000\39079 9/7/2000 10:05:22 AM rtrei This is a trigger for a discussion with Support and Engineering mgmt about putting the Ingres 2.0 patch out for 4.5 and 4.6 If we just put the patch out for customer's to download with written instructions the work is minimal. If we want to wrapper it (to make sure II_PATCH_LEVEL gets set), I estimate 1w for CM or DB team, plus QA test time. 9/14/2000 11:40:24 AM rtrei Closing as call ticket closed. i 9/8/2000 11:41:40 AM foconnor Seeing stack dump messages in their errlog.log file. Customer was experiencing Statistic rollup failures due to append to stats2 errors dropped stats2 table Ran rollup: Table nh_rlp_boundary inconsistent, deleting row: type: ST stage: 2 max_range: 960692399. Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Wed Aug 30 16:24:22 2000) Drop rlp boundary table and ran rollup. melr% nhiRollupDb Begin processing (05/09/2000 15:54:50). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Tue Sep 5 15:11:10 2000) ) Ran cleanStats Ran rollups same error ran cleanStats again ran rollups same error Requested errlog.log again, found stack dump errors. Excerpt from errlog.log Stack dmp name 41966 pid 29327 session 0: EDB5FF90: CSMT_setup(00000001,00BCE060,00AEB620,00000000,00000000,00000000) Stack dmp name 41966 pid 29327 session 0: EDB61B68: EF656518(00604AE0,00004000,00BCE150,00BCE060,0005023C,EF666BC8) Stack dmp name 41966 pid 29327 session 0: EDB61D78: CSMT_setup(00BCE060,00000000,00000000,00000000,00000000,00000000) MELR ::[41966 , 0000a1ad]: Segmentation Violation (SIGSEGV) opv_parser(0x13bc60) @ PC 13be4c SP edb5f4b8 PSR fe001006 G1 0 o0 0 MELR ::[41966 , 0000a1ad]: Tue Aug 29 14:21:18 2000 E_OP0901_UNKNOWN_EXCEPTION I have received these files or output from the commands: infodb nethealth > infodb.out logdump > trans_log.out verifydb -odbms_catalogs -mreport -sdbname nethealth All the logs in $II_SYSTEM/ingres/files/*.log The Nethealth system messages. Files in //voyagerii/tickets/38000/38575/Central_melr/Stack_Dump 9/11/2000 9:23:41 AM rtrei Chris-- This needs to be written up on the CA web site. 9/25/2000 12:39:25 PM czarba Updated CA web site with additional info requested by CA tech support. Sent logs to pcs.cai.com 10/17/2000 4:23:16 PM czarba CA recommends doubling the values for the following parameters: opf_memory, stack_size, qsf_memory. vch-compression s hould be off. 10/18/2000 1:22:35 PM czarba Notified Farrell 10/18/2000 3:03:38 PM czarba Tech support would like this kept open until customer verifies this fixes problem. More likely these recommended changes will reduce the likelihood of another crash rather than prevent crashes altogehter. 10/26/2000 1:36:09 PM foconnor Customer emails and rollups are running again 10/26/2000 1:50:13 PM czarba CS reports customer rollups working now. 9/8/2000 4:28:32 PM snorman From: Cai, Qi Chi (Exchange) [mailto:qchicai@bear.com] Sent: Friday, September 08, 2000 4:03 PM To: 'Norman, Stephanie' Subject: t39254, attn S.Norman Hi, Stephanie, here is what I did: ( the error messages are exactly the same to me), Please let me know if yo< u have received this email. nethealthta-fi% mv nhiDialogRollup New_rollup nethealthta-fi% ls -l New_rollup -rwxr-xr-x 1 health comms 6756704 Sep 8 15:39 New_rollup nethealthta-fi% ./New_rollup -u health -d nethealth Begin processing (09/08/2000 03:57:03 PM). Error: Append to table nh_dlg1s_967521599 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 226 rows not copied because duplic ate key detected. ). nethealthta-fi% pwd /tmp nethealthta-fi% nethealthta-fi% ls -l $NH_HOME/log/Conv*log -rw-r--r-- 1 health comms 437 Sep 8 05:16 /opt/health/log/Conversati ons_Rollup.100001.log nethealthta-fi% cat $NH_HOME/log/Conv*log ----- Job started by Scheduler at '09/08/2000 05:16:02 AM'. ----- ----- $NH_HOME/bin/sys/nhiDialogRollup -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (09/08/2000 05:16:09 AM). Error: Append to table nh_dlg1s_967521599 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 226 rows not copied because duplic ate key detected. ). ----- Scheduled Job ended at '09/08/2000 05:16:25 AM'. 9/11/2000 9:26:15 AM rtrei Yulun-- 1. Get a copy of the database. 2. review the database worksheets and see if the problem matches any of the situations described there. 3. Doublecheck that the nhiDialogRollup we sent them had the cdb library recompiled so that they would get the changes. 4. If worst comes to worse, drop tables. 9/11/2000 11:35:09 AM yzhang Stephanie, I rebuilt nhiDialogRollup after building the cdbLib the new nhiDialogRollup named nhiDialogRollup_new (built from nh45 p14) is ready in ~ftp/outgoing directory. make sure have customer grab nhiDialogRollup_new from the site. then run nhiDialogRollup_new. If they still have problem with this, please have customer place their database in ~ftp/incoming directory. and let me know it. Thanks Yulun 9/14/2000 12:04:57 PM snorman . 9/14/2000 3:39:19 PM yzhang please have customer run the attached script in the following steps: 1) cd $NH_HOME, source the nethealthrc.csh 2) copy the attached script into $NH_HOME/fix10350.sh 3) run the script 4) run the nhiDialogRollup_new (the one I placed in ftp/outgoing site) as nhiDialogRollup_new -u $NH_USER -d NH_RDBMS_NAME I tested this script on the customer database and it rollup fine. Let me know the result. Thanks Yulun 9/18/2000 9:52:02 AM snorman From: Cai, Qi Chi (Exchange) [mailto:qchicai@bear.com] Sent: Monday, September 18, 2000 9:38 AM To: 'Norman, Stephanie' Cc: Diglio, Jeannette (Exchange) Subject: RE: the nhiDialogRollup problem. Stephanie, The Converstation rollup log file revealed no errors this morning, you can close this ticket now. Please note that I have not replaced the binary rollup file (nhiDialogRollup) with the one you sent to me. In other words, after running the attached script which dropped some tables, both the nhiDialogRollup_new and the original rollup work fine. I don't think Jeannette want the original binary rollup be repolaced unless the new one is delivered in form of an official patch. Chi v9/12/2000 7:14:21 AM tstachowicz 1) There is no warning that the default rollups are scheduled at 8:00pm on page 23-17 (user guide) 2) With a fresh install with defaulted scheduled jobs, in terms of a day, the nhSave will run before the rollups because most customers schedule it at 1:00pm (which is suggested in user guide) which is before 8:00pm 3) this causes a problem: - If the user deletes elements typically during the day. Those elements are put into the nh_deleted_elements table. - If the rollup occurs first the nh_deleted_element table is purged. - The save occurs and saves the empty nh_deleted_elements table - the fetch occurs and it does not populate the deleted elements (because there is nothing in the nh_deleted_element table) to the central site. What could be done: - change the doc's to clearly suggest nhSave to be later then 8:00pm - Make a default rollup job specifically for remote pollers and point this out in the doc's that this must be changed for remote pollers 9/1/2001 3:19:04 AM AR_ESCALATOR Administrative change. This ticket has been created as an Enhancement Request. 9/13/2000 4:36:32 PM snorman It is taking approx 15 minutes to write 1 meg of data. Please see Brad for more details. 9/15/2000 10:57:33 AM yzhang It only takes about 12 second to load all the data, and takes about 1 second for loading each table. The problem described that it takes about 15 minutes. I don't know why they take that long. This is bulk copy, and should be very quick. attached is the script and output files 9/18/2000 11:01:27 AM yzhang Brad, The description for this problem indicated that you knew more details about this problem. I tested loading the data in two ways, if I drop index, then load the data, then creating index again, the whole process takes only about 1 minites. But if I create index first, then load data, it will take about 30 minutes. my suggestion for speeding the data loading is to drop the index before loading the data. I don't know if you have any comment on this. Thanks Yulun 9/20/2000 1:40:54 PM rtrei As we expected Yulun, this just got escalated. 9/21/2000 1:05:18 PM yzhang The following is a little research based on your advince I did regarding where the dlg0 table, and index get created, and where the the dlg0 table get loaded. Place to create and index dlg0 table CdbTblDlg.sc creates dlg0 table, through createTable() call dbcreateTable in duTable.C. dbcreateTable in duTable.C calls reorgTable()in CdbTblDlg.sc to creates primary and secondary index for dlg0 table. place to load dlg0 table dbTblsDlg.C has a function called loadsamplefile() which calls loadsamplefile()in CdbTblRlpBoundary.sc, which, in turn, calls appendTable() to copy file to the dlg0 table. the autoReorg parameter passed to appendTable is no, that means no reindexing after loading the data. The way to modify the code is: 1) modify table to heap if getTableClass returns dlg0 table. do this before call appendTable(), also pass autoreorg as yes to appendTable function. Thus we drop index before each load, and reindex after each load, hopefully this will speed up the loading. Jay, I want to talk with you about this, tell me when you have time so I can meet you, or you can come to my office any time. Thanks 9/28/2000 10:29:56 AM yzhang Jay, I have made the code change for this problem. and have run it throuth debugging. Robin has quickly reviewed the code change. Can you review it? You can stop by my office anytime. After your review, We can send the executable nhiPoller to PainWeb. Thanks Yulun 9/29/2000 2:10:44 PM yzhang Stephanie, Can you have customer grab executable nhiPoller_painweb from ~ftp/outgoing in ftp.concord.com. and replace it with nhiPoller in $NH_HOME/bin/sys. then run the polling from console. When they are doing ftp from the site make sure the number of byte transfered is correct. Let me know the resullt. Thanks Yulun 10/5/2000 3:13:06 PM yzhang The new nhiPoller I built is from nh45 Patch13. I am wondering if you can have customer run the new nhiPoller with nh45 Patch13 on Solaris 2.6 10/6/2000 10:21:00 AM yzhang The permission has been set, and it is now running ok (no error message comming out from console), I will watch it for a few more polls. If no error comming out, I will ship the new execuble to painweb. Thanks Yulun 10/6/2000 3:49:59 PM yzhang Bob, The new nhiPoller_nh46 is now in ~ftp/outgoing from our site. Cuustomer can grab and install it. As you mentioned , when customer installs the new poller, they need to set the ownership and protections properly to prevent error message. First it must be owned by root. Second it have to set the sticky bit. chown root nhiPoller chmod +s nhiPoller Thanks Yulun 10/12/2000 8:54:01 AM yzhang Customer stated that the problem appears to be corrected with the new nhiPoll< er. The call ticket can be closed as resolved. 10/13/2000 10:31:27 AM rsanginario Thursday, October 12, 2000 8:54:01 AM yzhang Customer stated that the problem appears to be corrected with the new nhiPoller. The call ticket can be closed as resolved. RS: Changing status back to Fixed. This will be addressed at Tribunal today. 10/31/2000 8:55:35 AM rsanginario Target Release should be R-4.6.0 P6 R-4.7.1 P2 R-4.8 R-5.0. Field doesn't allow that many characters so I'm putting it here. E9/14/2000 11:15:29 AM schapman If a command line load is initiated a scheduled save can kick off and overwrite the load if it is in the same directory. This has been duplicated in house and reported at a customer site (Belgacom) related files are on voyagerii\tickewts39000\39424 9/14/2000 3:26:45 PM yzhang Sheldon, I could nor reproduce the problem 10410. I scheduled dbsave before doing the dbload in the same directory as dbsave. But my dbload is succeeded, and I did not see dbsave kick off. do you think I am missing anything. Thanks Yulun 9/18/2000 11:52:55 AM yzhang Sheldon, I did save and load db remotely. first, I relogin to a remote machine, then scheduled a dbsave through console, then do the dbload through the command line, and the load and save have the same directory. But my loaddb was succeeded, and I did not see the kick off of the dbsave. Thus I still could not reproduce the problem. I am wondering if I missed something for reproducing the problem. Thanks Yulun 9/19/2000 10:38:07 AM yzhang The problem can not be reproduced at this time. 9/22/2000 9:52:13 AM yzhang A simple test is done to check the system processes for the existence of the other process. Checked nhiLoadDb during Save, and nhiSave db during load. The presence of other process is not detected. 9/14/2000 1:46:21 PM jpoblete After forcing consistent nethealth DB attempted to save it from command line logged in as nhUser (neth) with the command: nhSaveDb -p 0901400.tdb nethealth right after we got the message: BUS ERROR and we found a core file in $NH_HOME attempted to save to other partitions, but still the same. Reviewed the file /var/adm/messages, and did not found messages regarding a problem with the SCSI bus. The core file is on the call ticket directory: \\voyagerii\Escalated Tickets\39000\39586 9/19/2000 9:40:54 AM rtrei Asked Jose for a status update. 9/20/2000 10:03:11 AM rtrei Yulun-- Some more data for this should be coming in. Can you look at the core file and see where they crashed. My guess is that it is system specific, related to some resources. 10/2/2000 2:37:44 PM yzhang The save.log shown in the ticket directory indicates the save was completed successfully. I could not reproduce this problem. my nhsaveDb was succeeded. 10/13/2000 2:22:20 PM yzhang waiting feedback from customer 10/23/2000 4:25:24 PM yzhang Following message in the savelog is the sourse of the problem. (dbExecSql): sqlCmd: COPY TABLE nh_run_step () INTO 'support1017.tdb/nrt_b23' (dbExecSql): sqlca.sqlcode: -33000 message sqlca.sqlcode: -33000 means that nh_run_step table is too big to be writen into a 2gb file. This looks a very rare case because nh_run_step table should not be that big. For confirming the problem. I need the size of file 'support1017.tdb/nrt_b23', Can you have customer get this info by doing ls -l nrt_b23 > size_run_step.out from the command line. The other prssibilty is that their database has been corruped, they need to destroy the current db, reload the original database. Thanks Yulun 10/23/2000 6:17:23 PM yzhang have customer run the following sql against the current database ( the database that has the saving problem). 1) login as nh_user 2)source nethealthrc.csh 3) sql $NH_RDBMS_NAME 4) modify nh_run_step to truncated \g 5) CREATE UNIQUE INDEX NH_RUN_STEP_IDX ON NH_RUN_STEP (JOB_ID, RUN_ID,STEP_SEQUENCE_NMBR) \g up to now, they should have empty nh_run_step table with NH_RUN_STEP_IDX index. Then they can run the nhSaveDb again. Let me know if this work. Yulun 10/25/2000 11:19:18 AM yzhang The possiblity for the hung is the ingres transaction log is full: The customer can kill the running sql, then run nhResizeIngresLog 2000 from command line. then go back to sql mode, and execute modify nh_run_step to truncated. If this is still not work, can you have customer send me the help table info for table nh_run_step through command line: echo "help table nh_run_step; \g" | sql $NH_RDBMS_NAME > step_table_info.txt Thanks Yulun 10/25/2000 11:43:16 AM yzhang Can you let me know the stats of this problem, If they still can not trucate the table, I will write a script for them to drop and then recreate the table. Thanks Yulun 10/25/2000 2:49:30 PM yzhang Customer got another DMT_SHOW error on stats0 table. support has requested customer to drop this table using verifyDb, Then run nhSaveDb. 2/8/2001 4:48:25 PM yzhang Problem solved, ticket closed 9/14/2000 3:41:07 PM foconnor Statistic Index failing, customer ran the cleanStats script but the script is not fixing the problem, statistic index job still failing. errlog.log does show an inconsistent database on July 31, no record in support of fixing an inconsistent database for that timeframe. ---- Job started by Scheduler at '13/09/2000 07:20:20'. ----- ----- $NH_HOME/bin/sys/nhiIndexStats -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (13/09/2000 07:20:20). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Wed Sep 13 01:22:19 2000) ). ----- Scheduled Job ended at '13/09/2000 07:22:19'. ____________________________________________________________________________________________----- cleanStats script output: date nh_stats0_968551199 INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. Ingres SPARC SOLARIS Version OI 2.0/9712 (su4.us5/00) login Tue Sep 12 09:40:47 2000 continue * Executing . . . (2 rows) (0 rows) continue * Your SQL statement(s) have been committed. Ingres Version OI 2.0/9712 (su4.us5/00) logout Tue Sep 12 09:40:52 2000 INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. Ingres SPARC SOLARIS Version OI 2.0/9712 (su4.us5/00) login Tue Sep 12 09:40:52 2000 continue * Executing . . . E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Tue Sep 12 09:40:55 2000) continue * __________________________________________________________________________________________________ Files can be found: //voyagerii/tickets/39000/39542/tickets1318.tar directory errlog.log, Statistics_Index.100005.log, system log and the cleanStats.sh output //voyagerii/tickets/39000/39542/nhiIndexDiag.out & ticket39542.tar(database) //voyagerii/tickets/39000/39542/session/session.dat files 9/20/2000 10:04:56 AM rtrei Yulun-- Can you load this database for me, and also load one of the duplicates from nhIndexDiag into a dummy table? Then come get me. I'd like to look at these duplicates, too. Supporsedly, they are a new kind. 9/20/2000 1:56:07 PM yzhang Farrell, The tar file in //voyagerii/tickets/39000/39542/nhiIndexDiag.out & ticket39542.tar(database) is not the whole database, it has only the stats data. We need the whole database to dialogize the problem. Let me know when you get it. Thanks 10/3/2000 9:20:09 AM yzhang Please have customer run the attached script to clean the duplicate in the datadase. All indexes should be created after runing the script. 10/11/2000 9:48:50 AM yzhang The rollup is ok now. 9/25/2000 12:05:21 PM apier nhDiagMonitor should also monitor Ingres processes to make sure the database is available. In certain instances it is possible for the nhiPoller process to continue polling even though the database is not available. One case occured recently at MCI where the iidbms process died with a 'stack dump'. In this case the only indication tha< t there was a problem were the messages in the console window: -Unable to add 'network element' data to the database, dropping this poll. -Sql Error occured during operation (E_LQ002D Association to the dbms has failed. This session should be disconnected.). -Unable to execute 'MODIFY nh_stats_poll_info TO TRUNCATED'. If the administrator is not watching the machine or the console window is closed then the condidition will go unnoticed. nhDiagMonitor also does not report this type of failure condition. The nhiDiagMonitor should not only monitor critical Network Health processes, it should also monitor critical Ingres. In the event of these processes failing it should generate and appropriate error code. 9/1/2001 3:19:04 AM AR_ESCALATOR Administrative change. This ticket has been created as an Enhancement Request. L9/26/2000 6:14:18 PM rkeville Database has duplicates after running cleanStats script, Indexing fails. - Customers Index fails during rollup, has them run the cleanStats script, it still has duplicates. - From statistics_rollup.log: - Begin processing (09/21/2000 19:00:26). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. The database is located on the FTP server in the incoming directory and is called Sep26ARCsave.tar and the files are on voyuagerii. #################################################### 9/28/2000 10:32:31 AM yzhang It is loading the database now. 9/28/2000 5:04:42 PM yzhang Bob, I am still loading their database. the database is very big. Hopefully the database loading can be completed tomorrow morning. the I can investigate the problem with the loaded database. Yulun 9/29/2000 2:26:28 PM yzhang The database load ing is completed, and have run nhiIndexDiag. and I am analyzing the problem now. 10/4/2000 8:56:37 AM yzhang please have customer run cleanStats.sh (the one located in Voyagerii/Scripts). First login as nhuser, cd to NH_HOME, then source nethealthrc.csh. Then run the following command: cleanStats.sh clean With clean as argument to cleanStats.sh make sure the duplicates can be deleted. running cleanStats.sh without clean argument only reports the duplicates in the database. Run the nhiRollupDb after running cleanStats.sh clean. Thanks Yulun 10/4/2000 3:29:02 PM yzhang the db is rolling up successfully now 10/5/2000 4:27:40 PM jpoblete Customer experience problems with a stats0 table due the non-recoverable DMT SHOW Error, we attempted to save the Db with nhSaveDb command The log shows that after find the offending table it stops.: begin processing (10/05/2000 12:34:45 PM). Copying relevant files (10/05/2000 12:34:48 PM). Unloading the data into the files, in directory: '/opt/concord/bin/support1005.tdb/'. . . Unloading table nh_daily_exceptions . . . Unloading table nh_daily_health . . . Unloading table nh_daily_symbol . . . Unloading table nh_assoc_type . . . Unloading table nh_elem_latency . . . Unloading table nh_element_class . . . Unloading table nh_elem_outage . . . Unloading table nh_elem_alias . . . Unloading table nh_element_ext . . . Unloading table nh_elem_analyze . . . Unloading table ex_tuning_info . . . Unloading table exception_element . . . Unloading table exception_text . . . Unloading table ex_thumbnail . . . Unloading table nh_list . . . Unloading table nh_list_item . . . Unloading table hdl . . . Unloading table nh_hourly_health . . . Unloading table nh_elem_latency . . . Unloading table nh_col_expression . . . Unloading table nh_element_type . . . Unloading table nh_elem_type_enum . . . Unloading table nh_elem_type_var . . . Unloading table nh_variable . . . Unloading table nh_mtf . . . Unloading table nh_address . . . Unloading table nh_node_addr_pair . . . Unloading table nh_elem_assoc . . . Unloading table nh_job_step . . . Unloading table nh_list_group . . . Unloading table nh_run_schedule . . . Unloading table nh_run_step . . . Unloading table nh_job_schedule . . . Unloading table nh_system_log . . . Unloading table nh_step . . . Unloading table nh_schema_version . . . Unloading table nh_stats_poll_info . . . Unloading table nh_import_poll_info . . . Unloading table nh_protocol . . . Unloading table nh_protocol_type . . . Unloading table nh_rpt_config . . . Unloading table nh_rlp_plan . . . Unloading table nh_rlp_boundary . . . Unloading table nh_stats_analysis . . . Unloading table nh_schedule_outage . . . Unloading table nh_element . . . Unloading table nh_deleted_element . . . Unloading table nh_var_units . . . Unloading the sample data . . . Fatal Internal Error: Unable to execute: 'COPY TABLE nh_stats0_965285999 () INTO '/opt/concord/bin/support1005.tdb/nh_stats0_965285999'' (E_RD0060 Cannot access table information due to a non-recoverable DMT_SHOW error (Thu Oct 5 13:38:07 2000)). (cdb/DuTable::saveTable) The Db Save stopped at the table nh_stats0_965285999 and did not continued to save all other stats0 tables resulting in a invalid DB save. Spoke to Mike Anthony, Robin Trei and Don Gray about this, they agreeded to mark this as a BUG since the DB save should continue to save the Db when it finds a table with the DMT SHOW error 10/13/2000 12:32:50 PM rtrei Yulun-- This is one we would like to get out for 4.8 and patch all the releases. We need to talk about whether we want a flag or not with the Support people and PM. 11/13/2000 10:15:16 AM yzhang trying capture the error code number. 1/11/2001 11:35:43 AM yzhang changed to assigned 1/16/2001 8:43:56 PM yzhang After you delete the stats0 row from nh_rlp_boundary, you can do a save db again. Sometimes it works in this way. If you have the database backup, can keep it because I want to use it for the permanent DMT show error fix. If you do not have backup, I want to work with sometime tomorrow to save it. Thanks Yulun 3/1/2001 8:10:41 PM rtrei I recommend this go out in 5.0 3 days to fix and unit test. 4/20/2001 11:31:45 AM pkuehne Changed 'Assigned Priority' to "High" 8/27/2001 8:44:51 AM smcafee Changed 5.0 Mediums and Lows pre-aug 22nd to Archived 10/5/2001 8:17:58 AM dbrooks closed -see above comments. 10/23/2000 12:20:14 PM foconnor I have a customer who is having some conversation rollup issues and we are not able to enable jobs in the scheduler. We can enable jobs but several minutes later they show up as disabled. For example, I have ran the command nhSchedule -enable for jobs 10000, 100002, 10003 and 100004 sucessfully but when I got to 100005 I get this errror: pinkpanther% nhSchedule -enable 100005 Fatal Internal Error: Unable to receive message from another process - you may n eed to restart the Network Health server (Connection reset by peer). (ccm/ccmGet Packet (pkt header)) And then all the jobs show up as disabled again(except for one MyHealth). This server is also experincing issues with conversation rollups. Files: //voyagerii/tickets/39000/39849/Oct_23 10/23/2000 1:53:21 PM mikep Check for core files - sounds like nhiDbServer is crashing? 100005 is nhiIndexStats which can only be disabled via command line nhSchedule, you could disable the others via the GUI and see if that works. If no core file is produced - set advanced logging on for the dbServer and send me the log after you execute nhSchedule. Thanks, Mike P. 10/23/2000 2:39:02 PM foconnor To clear a point. In order for us to run command line dialog rollups I had disabled the conversation rollup entry via command line on Friday. On Monday the customer complained that all the scheduled entries were disabled. It is unkown how the rest of the entries became disabled. Upon trying to re-enable the scheduled entries I get errors, core file, and all the entries except one myHealth report gets disabled again. 10/30/2000 2:06:23 PM don provided info 10/30/2000 2:35:21 PM mik< ep I logged into customer site and grabbed the contents of the job schedule tables and loaded them on my system. I was able to enale and disable them with no problems. Farrell informed me that the dialog rollups were running a long.... time without ending. I looked at the Ingres error logs and there are numerous cases of forced Query abort .... Let's get their database in here and see if we can reproduce any of these problems. Mike P. 11/1/2000 11:40:59 AM foconnor Told mike where the database was saved on Mobilx's system 11/1/2000 2:41:04 PM mikep Got their database and loaded it here - received bunch of errors on startup about invalid job steps. Found that all step_types in the nh_job_step table were set to 100, which is incorrect. I shutdown the servers and fixed the step types, also added step type for nhiIndexStats which was missing. I restarted the servers and nhiIndexStats and nhiDlgRollups ran successfully. Will close this in a few days if problems all gone. Don't know how nh_job_step table was corrupted - system had severe problems with Conversations Rollups and 1.8 million nodes - perhaps memory problems caused it. Mike P. 11/2/2000 4:16:03 PM mikep Customer is up and scheduled jobs are wokring. ) 10/27/2000 11:33:06 AM foconnor Customer's 1st conversation rollup of each day was running so long that the customer would have to reset the server so that other jobs could run. We have attempted to reduce the unreferenced nodes from 1.5 million to somthing smaller by setting NH_UNREF_NODE_LIMIT variable from 50 to 40 to 30 to 20 but running the nhiDialogrRollup from the command line has only yielded into reducing the node count down to 1.2 million nodes. We have been attempting to run the rollups by the command line for a week but the rollups run continuosly and we do not seem to be progressing. I have attempt to run a rollup with the -Dall set but the log file created was starting to filll out the customer's tmp directory (3.8 GB). Customer has disable all rollups so that we could attempt to get the number of nodes down. Sun Solairs 2.8 4x450 cpu 4 GB mem. 6 GB swap 2GB Transaction log Location Name Free Space Path +-------------------+------------------+---------------------------------+ | ii_database | 25911557000.00 bytes | /opt/concord/idb | +-------------------+------------------+---------------------------------+ Statistics Data: Number of Elements: 241 Database Size: 55148544.00 bytes Location(s): ii_database Latest Entry: 27/10/2000 17:26:11 Earliest Entry: 30/08/2000 00:00:00 Last Roll up: Conversations Data: Number of Probes: 11 Number of Nodes: 1221807 As Polled: Database Size: 35143680.00 bytes Location(s): ii_database Latest Entry: 25/10/2000 15:00:00 Earliest Entry: 24/10/2000 12:00:00 Last Roll up: Rolled up Conversations: Database Size: 1225056256.00 bytes Location(s): ii_database Latest Entry: 24/10/2000 11:59:59 Earliest Entry: 10/09/2000 00:36:11 Rolled up Top Conversations: Database Size: 789716992.00 bytes Location(s): ii_database Latest Entry: 24/10/2000 11:59:59 Earliest Entry: 30/08/2000 12:20:55 Files: //voyagerii/tickets/39000/39849 Current files can be obtained via telnet 10/30/2000 4:30:59 PM yzhang downloading the database now, then do the debugging on the conversation rollup 10/30/2000 5:41:21 PM yzhang Farrell, The customer database has 1.6 M rows in the nh_element table, and their NH_POLL_DLG_BPM is 1500. Can you have customer run cleanDlg and then cleanNodes scripts (they are located in escalation directory), followed by running nhiDialogRollup. Thanks Yulun 10/31/2000 10:50:59 AM yzhang Farrell, Robin and I looked at the both scripts, they look ok, and you can have customer run the scripts. Before running the scripts, customer should create a temp directory with sufficient disk space. Thanks Yulun 11/2/2000 9:41:17 AM yzhang ticket closed D11/1/2000 5:14:42 PM mpoller Right now the only toll we have for checking on the Db status is the nhDbStatus command which only reports on Db properties. It is too general of a command for this customer. They would like a new command that would analyze the structure of the Db to find any inconsistencies, corruption or other possible faults. Note: Request received from Colin Kopp nethealth@net.gov.bc.ca (250) 387-8051 Reference ticket 41491 9/1/2001 3:19:06 AM AR_ESCALATOR Administrative change. This ticket has been created as an Enhancement Request. 311/1/2000 5:19:54 PM mpoller Right now, when an nhDbLoad is run, the load.log file is created. Once this log file is created it remains at 0k until the load is or until a buffer is written out to the load.log file. What this customer would like is for the load log to be written to stdout as the load runs or don't buffer the output to load.log. This way they can view the load log output as it is written and can see any errors occur as they happen. As well it gives the customer a warm feeling that something is happening. Large database loads can take quite a while to complete. Note: Request received from Colin Kopp nethealth@net.gov.bc.ca (250) 387-8051 Reference ticket 41491 9/1/2001 3:19:06 AM AR_ESCALATOR Administrative change. This ticket has been created as an Enhancement Request. 11/2/2000 5:04:04 PM snorman This has NT SP5, latest patches for Network health. I have received the advanced logs and they are in the escalated tickets directory. Since there really isn't any way to change the stack, I need some assitance. 11/6/2000 12:58:28 PM rtrei Yulun-- I think the database worksheets discuss how to set the stack for an NT system. (If they don't then we need to update them.) This is an easy job, but Support can not do it. You need to have C++ installed. Can you get them an nhiDialogRollup.exe from the correct release build, and increase its stack? (If the instructions aren't in the worksheets, I'll show you what you need to do.) This has been sitting since Thursday, so we need to move on it ASAP. thanks, Robin 11/6/2000 4:00:58 PM yzhang Stephanie, The new nhiDialogRollup_11367.exe with larger stack size is now located in ftp/outgoing from ftp.concord.com. You can have customer run this new nhiDialogRollup. If they still has stack oversize problem I can double the size later. Robin, thanks for your help. Yulun 11/7/2000 9:58:28 AM don This did not work ! 11/9/2000 9:21:37 AM don escalated for customer sensitivity 11/9/2000 10:55:08 AM yzhang Bob, This tickets has been ascalated this morning. I sent a new nhiDialogRollup.exe with large stack size last week, I want to know what message the customer have this time with the new exe I sent. Thanks Yulun 11/13/2000 10:29:08 AM yzhang fixed in 471P2 11/13/2000 10:32:13 AM yzhang This is still more info, the update with ixed in 471P2 is a mistake 11/16/2000 6:59:53 PM rkeville Customer has been contacted a number of times with no response. 11/17/2000 1:06:48 PM rkeville Three stike letter sent. 2/2/2001 3:13:07 PM cestep Have a new customer - same error. Got a screen shot of the error, found on Voyagerii. The customer is getting a Stack Overflow error from nhiDialogRollup.exe 2/6/2001 3:51:41 PM yzhang Can I have CM to build nhiDailogRollup.exe for 471 p1 on NT system. It would be a great if I can have this executable tomorrow. Thanks Yulun Zhang 2/8/2001 9:37:57 AM yzhang Jose, This executable with stack size of 8388608 is for prob. 11367. Have customer backup the original nhiDialogRollup.exe, then run conversation rollup with the attached. 2/11/2001 5:43:04 PM yzhang update to moreinfo 2/13/2001 9:42:34 AM jpoblete Here's the response from customer. -----Original Message----- From: Gottlieb, Matt [mailto:MGottlieb@ikon.com] Sent: Monday, February 12, 2001 9:05 PM To: Poblete, Jose Subject: RE: Concord Call Ticket 44< 984 - nhiDialogRollup.exe Sorry for not getting back to you last week and I was out of the office today. It's been a crazy couple of days. Yes the fix seems to be working. John Witte (our local Concord SE) was at our office last week and we found a couple of problems and we installed the file you sent me. The database had not rolled up since last August and had grown to over 10 gig. After a little coersion it finally rolled up and is considerably smaller now and the server is no longer Dr. Watsoning. I'm keeping an eye on it and will let you know if the problem returns. Thanks for the help. -Matt Gottlieb 2/13/2001 5:29:15 PM pkuehne Closed pending more information. Ticket will be reopened if more information is received. Peggy Anne Kuehne D 11/3/2000 3:09:04 PM rkeville Unresolved deadlock causing nethealth database to be marked inconsistant. - Thu Oct 26 17:07:44 2000 E_DM9042_PAGE_DEADLOCK Deadlock encountered locking page 39 for table iirelation in database nethealth with mode 5. Resource held by session [29606 3d4]. - Please refrence the errlog.log file in the escatated tickets directory for Thursday Oct 26, this is where the fun began. - The customer performed an nhForceDb and recovery, they are backup and running at this time. Additional requested information located on the escalated tickets directory. ##################################################### 11/6/2000 12:46:24 PM rtrei I asked Support to create a ticket for this as I want to report it to CA and track it. This inconsistency happened on 4.7.1 which includes the latest 2.0 patch. Therefore I want CA to look at the files and see if they see anything. Beyond sending the data to CA, there is little additional work I will need to do on this one. 11/8/2000 11:03:21 AM rtrei Your issue was created successfully. Please note the following: Product: INGSRV Open Date: Nov 8, 2000 : 8 Contact #: 10445028/1 Company: 167267 - CONCORD COMMUNICATIONS INC Please refer to this contact number in any communication regarding this issue with CA Technical Support. Your issue was assigned to group: INGRES ISL S 1 This ticket should be de-escalated. There is nothing immediate that will happen. CA will look at the logs and possibly make an update to an II patch. (2.0 will not be updated.) 11/27/2000 4:29:37 PM don We need to escalate this in CA the customer wants reassurense before going to production with this system 11/28/2000 5:13:44 PM rtrei Repinged CA and asked this be set to a priority 1. In all honesty, they have been concentrating on 4.8 priority 1 issues. However, I will check them tomorrow for status. 12/13/2000 9:25:51 AM rtrei Have asked for a list of table sizes from customer to check overflow pages 12/18/2000 11:14:16 AM rkeville -----Original Message----- From: Keville, Bob Sent: Monday, December 18, 2000 11:04 AM To: Trei, Robin Cc: Gray, Don Subject: RE: 11394 Robin, He doesn't have it, is this the one? (see attached) Thanks, -Bob 12/21/2000 2:39:11 PM rtrei Bob-- The sysmod did remove all the overflow pages in the iirelation table. CA's current theory is that the deadlock in the iirelation was caused because there hadn't been a sysmod and their were too many overflow pages. The inconsistentcy was obviously caused by the deadlock, so removing the cause of the deadlock should remove this type of inconsistency from happening. I'm not sure how much more the customer wants on this one. I would recommend that they regularly do a sysmod, say once every week or two. However, they must be sure they are shutting ingres down properly (via nhStopDb) or shutting the system down properly. I expect to put some type of patch or advisory out on this, but want to investigate and test this thoroughly first. How do you want to proceed? 12/28/2000 10:21:17 AM rkeville The sysmod of the database seems to have worked, lets close this one out. 3/1/2001 8:11:42 PM rtrei marking closed. We are recommending frequent sysmods E11/7/2000 4:18:27 PM shagar Have the ability to delete ingres users. We currently have the ability to add users with nhAddDbUser, but no way to delete them. Sambit Nanda DSL Net 203-782-3968 9/1/2001 3:19:06 AM AR_ESCALATOR Administrative change. This ticket has been created as an Enhancement Request. 11/9/2000 7:33:06 PM rkeville Upgrade appears to fail during database conversion.. - Customer attempted to upgrade from NH 4.6 to NH 4.7.1, in the install log recived the following message: - Converting database nethealth Non-Fatal database error: Failed to create index on nh_stats0_965753999 06-Nov-2000 05:53:05 - Database error: -33000, E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Mon Nov 6 05:53:04 2000) - Multiple tables in the install log are refrenced. - Rollups were failing since at least August 8, 2000. - Dropped the tables with a drop table script and removed the refrences to the ranges in the nh_rlp_boundary table. - Converted the database manualy. - Rolled up, saved, reinstalled and loaded the saved db. They are back up and running now, but the upgrade should not fail to convert the database. - Solaris 2.6 - NH 4.7.1 - Files located on voyagerii. ######################################################## 11/13/2000 11:40:30 AM rtrei A non-fatal database error is what the customer should have gotten. Non-fatal means that the conversion continued, but that there was a problem that needed to be fixed. It sounds like that is exactly what happened. ;11/17/2000 12:49:11 PM wburke Statisitcs Rollup Failure due to append to stats1 Ran cleanStats no effect. nhiRollupDb failed again. We attempted to drop the table but with no success ran command line rollups, failed I had him run the cleanStats.sh clean and it returned nothing had them run the nhiRollupsdb and failed with error. 11/17/2000 12:49:50 PM wburke _______________________________________________ Db on voyagerii/42000/42029 11/20/2000 11:43:16 AM yzhang The testing on nhiRollupDb is still going on on system zeppelin. But it looks OK. Have customer run the attached two scripts: Login as nh user then source nethealthrc.csh, Then type fix11615.sh from command prompt. after this do ./stats1Dup.sh nh_stats1_969890399. Run nhiRollupDb after running these two scripts 11/27/2000 1:20:35 PM yzhang by looking at their log file you attached, I think they run stats1Dups.sh without argument. the correct way to run it is: ./stats1Dup.sh nh_stats1_969890399 Have customer issue the above command after login as nhuser and doing the source net*.csh. Thanks 11/28/2000 12:34:42 PM wburke -----Original Message----- From: Meacle, Michael A [mailto:Michael.A.Meacle@team.telstra.com] Sent: Monday, November 27, 2000 10:27 PM To: Burke, Walter Cc: Chew, Beng Subject: RE: ticket # 42029 Walter, Good news, all appears OK. The rollup took ~2Hours and completed without any problems. I will monitor it over the next few days. As far as I'm concerned this fault (42029) is resolved, please feel free to close it whenever convenient. Thanks for your very helpful assistance. regards Mick Meacle, Internetworking Specialist, Enhanced Business Services, Internetworking Solutions. Phone: (07) 38876013 Mobile: 0417 716 562 p.s. now for some sleep 11/17/2000 6:42:06 PM rkeville Repeated QEF errors in errlog.log file, rollups failing constantly. - NH 4.6 P03 - Solaris 2.6 - They start recieving QEF errors and the rollups start to fail afterwards. - Initaily they had low settings for the system resources, I recomended the following, with no effect. - set shmsys:shminfo_shmmax=50000000 - set shmsys:shminfo_shmmin=600 - set shmsys:shminfo_shmseg=600 Logs are on the escalated ticket dir on voyagerii. ############################################### 11/21/2000 10:20:41 AM yz< hang Please have customer run the attached script to clean the duplicate in the datadase. All indexes should be created after runing the script 11/22/2000 2:31:48 PM yzhang has asked customer double qef_sort_memmory size to take care the QEF error, then run nhiRollupDb 12/5/2000 2:55:46 PM rkeville Walked customer through the proceedure, we will wait until wendsday to see if this is resolved. ############################################### 12/6/2000 5:12:44 PM rkeville They are still reciving QEF error messages in the errlog.log file. ############################################### 12/7/2000 3:02:44 PM yzhang Now customer did the rollup successfully, but the there is still the same QEF error in the error.log file after double the qef_sort_memory size. I looked at the ticket you submitted to CA. Looks they suggested double the rdf_memory also. Do you think I can have customer double the rdf_memory. 12/7/2000 4:05:35 PM yzhang Bob, You can have customer double rdf_memory size in the following way. CBF --> DBMS Server --> Configure --> Derived --> rdf_memory. Thanks Yulun 12/8/2000 11:53:56 AM rkeville Walked customer through new config of ingres. 12/18/2000 11:32:59 AM rkeville Spoke to customer, - There are only a few QEF errors in the errlog.log file and the rollups are running successfully now, close ticket. ################################################## 12/20/2000 3:44:36 PM yzhang The ticket is closed 11/22/2000 9:21:23 AM rkeville Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. - Database is located on voyagerii. - Unable to find the table with the duplicate key, it looks like it is from Sept 11. I deleted all the tables for this date and the rollup still fails. - Ran "select _date(min(sample_time)), _date(max(sample_time)) from nh_stats0xxx" on some of the stats tables but I was unable to find the problem table. ################################################### 11/27/2000 11:01:18 AM yzhang The problem table is the last stats0 table. Can you have customer run the attached nhCleanDupStats.sh script just by typing the script name after login as nhuser and doing the source net*.csh. Thanks Yulun 12/1/2000 1:06:52 PM rkeville Issue is resolved, close ticket. 12/4/2000 10:13:53 AM yzhang The script solved the problem. f11/27/2000 4:14:41 PM xzhang element_type 351 is not in the nh_element_type table. I was able to find some element of this type. When I run health report. I got the following error: Fatal Error: Assertion for 'Bad' failed, exiting(Element type 351 is not defined in nh_element_type in file ../CdbCacheElementTypes.C, line 274). Report failed. 12/1/2000 11:53:54 AM jay This is a certification issue. Element type 351 was deprecated and changed to element type 3000 during the 4.7 release. During the same development time frame as 4.7, certification created 2 new mtfs using that type. When the mtfs were merged into 4.7, they hadn't been updated. Specifically, the following MTFs need to be updated. I am leaving this with certification as they may want to apply this to 4.7 cert release as well and they don't want to create 351 MTFs any more: hp_managewise_partition.mtf: mediaType = -351 # sysPartition quotron-aix-system-partition.mtf: mediaType = -351 # SYSTEM partition (storage) 12/4/2000 10:48:51 AM nsaparoff Chau changed the element types of mtfs affected. ZS11/28/2000 11:53:58 AM foconnor Customer cannot run reports or perform a database save. here is what happened step by step: From the reseller: Sat 18 Nov - customer upgraded to 4.7.1 one error message occured here; see ticket 42465 - customer installed P1 subsequent error messages Expectation for Bad failed...(fix known, but not yet installed) error messages during install, see ticket 42466 - customer deinstalled P1 - customer installed D02 Wed 22 Nov - customer calls, NH down -> see error messages in this ticket description - I asked to look at one of the aaaa*-files -> 'more' crashed with core-dump on this file -> we didnt test further here - shut down of nhServer and DB, restarted DB and NH - seemed OK - DMT_Show error message in errlog; I suggested to destroy/reload the DB - no more error messages in errlog until Sat. Sat 25 Nov - customer save/destroy/created DB but failed to reload saved DB - customer loaded last good DB from Tuesday - seemed to work - inbetween she did a fscheck with no results Mon 27 - again error messages (no errors on sunday) - customer says DB Save was ok (we dont have the log) - Jobs didnt run Tue 28 - same status DB fails now again Error from attempting to run a scheduled trend report. Error: Sql Error occured during operation (E_QE0080 Error trying to position a table. (Mon Nov 27 21:14:24 2000) ). Report failed. Appending 50 lines of the Ingres errlog.log file. SLSANH ::[55774 , 40f5d3e0]: Mon Nov 27 21:14:23 2000 E_CL060F_DI_EXCEED_LIMIT File resource quota exceeded SLSANH ::[55774 , 40f5d3e0]: Mon Nov 27 21:14:23 2000 E_DM9005_BAD_FILE_READ Disk file read error on database:nethealth table:nh_stats0_975333599 pathname:/database/idb/ingres/data/default/nethealth filename:aaaaamgn.t00 page:93 Database save fails: Fatal Internal Error: Unable to execute 'COPY TABLE nh_hourly_volume () INTO '/opt/nethealth/db/save/save02.tdb/hrv_b45'' (E_SC0206 An internal error prevents further processing of this query. Associated error messages which provide more detailed information about the problem can be found in the error log, II_CONFIG:errlog.log (Mon Nov 27 23:01:30 2000) ). (cdb/DuTable::saveTable) DataAnalysis is failing: Job started by Scheduler at '11/28/2000 00:10:44'. ----- ----- $NH_HOME/bin/sys/nhiDataAnalysis -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (11/28/2000 00:10:45). Error: Sql Error occured during operation (E_LQ0058 Cursor 'nh_stats0_974084399QC' not open for 'close' command. ). Error: Sql Error occured during operation (E_LQ0058 Cursor 'nh_stats0_974084399QC' not open for 'close' command. ). ----- Scheduled Job ended at '11/28/2000 00:15:18'. ----- Files: //voyagerii/tickets/42000/42641..... 12/7/2000 8:44:21 AM yzhang loading the customer database 12/8/2000 4:34:38 PM yzhang Just let you know that the datanbase loading is fine. I am running the db save , data analysis and trend report now to see if I have the sanme problem as customer has. 12/8/2000 5:13:02 PM yzhang This is the update with htis problem. I have run nhLoadDb, nhSaveDb and nhiDataAnalysis with customer's database. Each of the three executions are perfect. There is even no a single error message in the errlog.log. ( I have not tested the trend report yet, because I am waiting license data). Thus I agree with Robin she mentioned early that this customer's problem is indicative of a disk problem more than just a database problem. They need to make sure their hardware and disk system are properly configed. Or they may want to try the same thing on a better or new HP system if they have one. Yulun 12/11/2000 2:17:09 PM yzhang I tested the trend report manually, and it works fine, I scheduled a trend report, also I did not see a status as fail. 12/11/2000 5:21:55 PM yzhang Sheldon; I told you that my testing with customer's database on nhLoadDb, nhSaveDb, nhiDataAnalysis is perfect. and the test on scheduled trend report is oslo good. So I can not see anything wrong. It is most likely that their problems are hardware related. Robin suggested to check if the page allocation and the overflow page on the problem tables may cause the trouble. The attached script will collect those information. Have customer run the attached script just typing the script name from command prompt after login as nh_user and source the nethealthrc.csh. Then send me the concord.out file located under $NH_HOME/tmp It's < normal that the count of open Ingres files will change with performing different tasks. But I personally think the number of open ingres files should not cause their problem if the nethealth installation is successful. Robin The following are the customer's questions: I don't think I can give a good answer. So far I still don't know how the limit of open ingres files defined. I think this depends on Unix Kernel-Parameter setting. here is the most recent listing of open files of the ingres users and a system log to show the time when certain Jobs are running. There are some things to notice: The count goes from 2900 down to 180 at the time of the Maintenance Job (Server Restart) It increases to 1300 at the DataAnalysis (after Midnight) More increases when Health Reports are started around 3:00AM and when Database is Saved (5AM). In the end the count of open Ingres files is around 2700 again. On that level, the number is constantly increasing. Question: is this the normal behaviour of the number of open files belonging to Ingres? Is the absolute value in a normal range? How do these numbers relate to the `Maximum number of open files'-Kernel-Parameter, (which is set to 1024 in this environment)? 12/13/2000 10:49:06 AM yzhang Robin, Attached is the information from iitables for the problem tables the customer has. It looks the allocated space is enough for the number of rows the table has. basically the customer is very serious on the number of open ingres files. 12/13/2000 4:52:40 PM yzhang Robin talked to me this morning, she think the customer's concern on the count of open ingres files is reasonable. Again we believe that is something specifically related to the customer's Unix environment. and it has not very much to do with the database. Can you find out a unix expert who is good on Unix file system, and disk quato problem, and have that person to look at this problem. Thanks Yulun 12/14/2000 3:21:11 PM don customer has all the file settings correct from a UNIX standpoint. I looked at the ticket and say it was a files open issue from the OS. The customer is saying that the # of open files owned by ingres increases daily by about 100 files and until it reaches about 2800-3000 until ingres gets reset by them to keep the issue from happeneing. Can we check and see why this might happern ? can we check with CA and see if this normal ? Can we find out how many files ingres should open ?. T 12/14/2000 3:22:54 PM don Is this normal to have the number of open files by ingres to increase by about 100 each day? Is 2800-3000 open files by ingres normal? If ingres does not get reset will a crash occur again? If this behavior is not normal for ingres what changes in either the Operating System, Hardware or software can be changed or modified to correct the issue? 12/15/2000 3:11:42 PM yzhang Has opened tow tickets fo CA regarding the checksum error and too many open ingres files One of our customer has had a series of database issues lately and have become extremely sensitive to errors they see in the errlog.log files. This error (below) occurs about 20 times in a row. Now my question is: What causes "Page Checksum failure" messages in the errlog.log file? Does the query on the table keep going until it is successful? If the query on the table is unsuccessful how many times will it repeat before it fails? From errlog.log file: SLSANH ::[49168 , 40ac2040]: Tue Dec 5 14:01:14 2000 E_DM930C_DM0P_PAGE_CHKSUM_FAIL Page Checksum failure detected. Database nethealth, Table nh_stats0_973508399, Page 10. One of customer noticed that number of open files owned by ingres increasesdaily by about 100 files and until it reaches about 2800-3000 until ingresgets reset. They are very sensitive to the number of openfiles increasingbecause ingres crashed with the error of "Too many open files" a few weeksago! Excerpt from the errlog.log file: SLSANH ::[53031 , 416b2ae0]: Tue Nov 21 19:03:46 2000 E_CL060F_DI_EXCEED_LIMIT File resource quota exceeded DIlru: All files in the LRU are pinned open() failed with operating system error 24 (Too many open files) Even thorgh the database checked out well by me the remaining questions yetto be answered are: Is this normal to have the number of open files by ingres to increase byabout 100 each day? Is 2800-3000 open files by ingres normal? If ingres does not get reset will a crash occur again? If this behavior is not normal for ingres what changes in either theOperating System, Hardware or software can be changed or modified to correct the issue? For your information, they use lsof untility to check the number of open ingres files, as instructed below. this is the place to download lsof: | The latest release of lsof is always available via anonymous ftp | | from vic.cc.purdue.edu. Look in pub/tools/unix/lsof. | quickquick installation: Basically, all you have to do is to compile losf $ cd $ ./Configure -- this will guide you through some questions, where I always accepted the default $ make there will be a single binary, lsof, which you can place at whatever place seems appropriate in your environment There is a man page, which will take you about two days to read through, but there is also a 00QUICKSTART. The customer uses the command: lsof -u ingres | wc -l to get the number of files open by user ingres. I will send you the latest output of this command from the customer as soon as I have it. 12/15/2000 3:57:01 PM yzhang CA responded to me on the following checksun failure problem: CA think it is the hardware and bad disk that causes the checksum failure. they mentioned that the checksum error appears every time when a table residing on the bad disk is trying to be accessed. That is why the customer got the constant checksum error. Yulun 12/18/2000 9:07:46 AM foconnor I called Jan and he wants to have The customer had another checksum error. 12/18/2000 1:22:02 PM yzhang I posted the problem concerning with number of open ingres file to CA. they need following information for anwsering the question. Farrell, Please have customer send the following information. 1. Exact version of ingres/openingres the client is running. 2. Ingres patch number if there is any applied. 3. OS vendor and version. 4. Output of ulimit -aH or ulimit -a from the OS prompt. 5. Output of syscheck -v from the OS prompt. 12/18/2000 4:28:57 PM yzhang James, I am trying to telnet to alcanet by doing telnet 149.204.45.43, but I got message as unable to connect to remote host. Can you help me with this. Thanks 12/21/2000 2:45:39 PM rtrei I've logged this with CA as a priority 1. Dialed in and got a lot of data which I've uploaded to them. Prepared a workaround to uninstall the ingres script. Don G/Farrell have the information. Customer may postpone until after the holidays. 1/10/2001 9:35:09 AM yzhang Farrell, the patch installation was completed successfully in our test. I guess the customer may not use bin when doing ftp. Can you ask that the customer retry the ftp with bin set for the .TAR file. with ascii set for the document and script files. and make sure that they follow the direction described in the document file for installing the batch. Yulun 1/11/2001 10:59:59 AM yzhang Farrell, Tell customer that I did the patch install on HP11 with 64 bits machine on nh471. and the patch install was successfully, there is no checksum error. Also can you have customer monitor the number of open ingres files as they did before on their new nh471 install. Do the monitoring with ingres running. Send us the monitoring result, also send the config.dat that is located under $NH_HOME/idb/ingres/files. Thanks Yulun 1/11/2001 11:53:10 AM foconnor requested config.dat and monitoring results from ICS 1/15/2001 4:00:00 PM wburke Output of ulimit -aH or ulimit -a from cmd prompt. Output of syscheck -v from cmd< prompt. Type of hardDrive. requested 1/22/2001 2:34:54 PM yzhang Have logged a ticket to CA again as follows: We have a customer who got the checksum error and faulting a group of pages error showing in errlog.log as follows: u Jan 11 04:14:37 2001 E_CL2530_CS_PARAM default_page_size = 2048 slsanh ::[ingres , 00006377]: Thu Jan 11 04:14:37 2001 E_CL2530_CS_PARAM sec_label_cache = 100 SLSANH ::[59501 , 400e0a30]: Thu Jan 11 04:14:37 2001 E_SC0129_SERVER_UP Ingres Release OI 2.0/9712 (hp8.us5/00) Server -- Normal Startup. SLSANH ::[59501 , 40a265a0]: Thu Jan 18 23:06:40 2001 E_DM930C_DM0P_PAGE_CHKSUM_FAIL Page Checksum failure detected. Database nethealth, Table nh_stats1_976316399, Page 8. SLSANH ::[59501 , 40a265a0]: Thu Jan 18 23:06:40 2001 E_DM920D_BM_BAD_GROUP_FAULTPAGE Error faulting a group of pages and this morning, the same customer reported that they have checksum error on the table which has been dropped already (That is, the table is no longer existed in the database, but there is still checksum error reported for the table) Can you help us with following questions: 1) The customer has done hardware and disk check, It looks they are OK. We want to know again exactly what else causes the checksum error and faultpage error, and how to correct the problem. 2)The chechsum error seems to be happening only on our older tables. We read the data in these files, condense the data and then drop the old tables. why there is still chechsum error for a table which has been dropped? 3) Can you recommand what kind of debugging we can use to get better insight into what is going on and why? We have standard data (logs, config.dat,etc.) which we will upload if you want to look at it. 4) Do you want to look at anything else. 5) We reported the chechsum error to you in the past, But I hope you can reinvestigate this problem so that we have some more useful information for our customer. Thanks Yulun 1/26/2001 5:48:23 PM yzhang Just let you know that I run nhiRollupDb with customer's new database (customer saved it on 1/23/2001), The rollup completed successfully in about 15 minutes without any error message. The rollup was run on hp11 64 bit, ,NetHealth 4.7 which uses Ingres OI 2.0/ 9712 (hp8.us5/00) plus the 6640 patch. The new database contains their problem table nh_stats1_976316399, but does not contain the other problem stats0 table (nh_stats0_977464799). because they keep as polled for 4 weeks, so this table do not supposed to be there. The other thing I noticed from the customer rollup plan setup is that they kept the stats0 tables for four weeks as showing from following sql output : * select * from nh_rlp_plan \g Executing . . . lqqqqqqwqqqqqqwqqqqqqwqqqqqqqqqqqqqwqqqqqqqqqqqqqk xrlp_tyxrlp_stxactivexduration_sizexsample_size x tqqqqqqnqqqqqqnqqqqqqnqqqqqqqqqqqqqnqqqqqqqqqqqqqu xSD x 0x 1x 259200x 300x xSD x 1x 1x 345600x 14400x xSD x 2x 1x 604800x 86400x xSD x 3x 1x 2419200x 604800x xBD x 0x 1x 259200x 300x xBD x 1x 1x 345600x 14400x xBD x 2x 1x 604800x 86400x xBD x 3x 1x 30240000x 604800x xST x 0x 1x 2419200x 300x xST x 1x 1x 3628800x 3600x xST x 2x 1x 42336000x 86400x xST x 3x 0x 0x 0x xST x 4x 0x 0x 0x xST x 5x 0x 0x 0x xBL x 0x 1x 3628800x 3600x mqqqqqqvqqqqqqvqqqqqqvqqqqqqqqqqqqqvqqqqqqqqqqqqqj (15 rows) I don't know why they want to keep stats0 at such long period. This may (Possiblly) for some reason making the rollup go back to looking for some old tables. My suggestion is they might want to reset the rollup plan for only keeping stats0 tables for two days through console schedule job. Then do the rollupDb, see if there is still checksum error.If so, run the rollup in advanced mode with debug on. Also, Farrel , can you have customer run a verifydb against nh_stats1_976316399 in report mode: the syntax is 'verifydb -mreport -sdbname "" -otable ', then send me the $II_SYSTEM/ingres/files/iivdb.log. file. they did this last time without success. Thanks Yulun 1/30/2001 9:16:24 AM yzhang The file attached is not readable, but the good news is that they did not see the checksum error for a while. But let them know the following: 1) If they agree, they better change the stats rollup plan to only retain the stats0 table for 2 days or one week. 2) If they see checksum error again, run verifyDb for the problem table, and run rollup with advanced debugg on so we can locate more details. Thanks Yulun 2/2/2001 5:19:18 PM yzhang We have a customer who got the checksum error and QEF error frenquently, and they are very worried about if these would make their database unstable, or crash. would you help us with following questions so that we have something pass to our customer. 1. What is a checksum error ? 2. Does ingres retry after a chechsum error ? 3. if yes how many times ? 4 What error is produced if all retries fail ? 5. What are the benefits that may help this problem in the new release of Ingres shipped with 4.8 ? 6. What is a QEF error ? 7. What causes then ? 8. Is there anything that can be done to prevent these errors in the future ? 2/5/2001 1:07:04 PM yzhang This is the answer from CA for Don's list of questions. It looks that the answer has nothing new to us. The status of Alcanet problem now is: 1) I am waiting for their errlog.log file containing QEF error. 2) Customer should do a advanced stats rollup and run verifydb when they see checksum error again. 3) If they agree, they can reset the stats rollup plan for only keeping stats0 table two days I am wondering if any of you think there are any other actions we need to take to move this ticket quickly. This ticket has been t< here for long time. Yulun 2/15/2001 3:43:20 PM rkeville Customer installed P02 and said the issue is now resolved, ticket 43957 is closed. 2/21/2001 9:54:50 AM jnormandin Customer has closed call ticket # 42641 2/21/2001 3:56:53 PM don close 11/28/2000 2:21:31 PM shagar During installation of 4.7.1 P01, customer received the following error on screen: ------------ Converting database nethealth su: illegal option -- f <<<<<<<<<<<<<<<< here is the error The database nethealth has been successfully converted. Restarting http server /opt/nethealth/web/httpd/bin/nhihttpd http server started. Successfully installed Patch NH471901 Please restart Network Health slsanh:/var/NH4_7_P01# ---------- Everything seems to be working properly, but customer does not want any repercussions of the error message. ICS - GMBH Alcanet International 49-89-74 85 98 90 support@ics.de 12/13/2000 7:21:06 PM lemmon Reassigned to db team 12/14/2000 11:00:40 AM yzhang Sorry I sent a uncompleted email to you a few minutes ago. I checked the install code, which looks like this: InstallLiveT.sh:# echo "Converting database $NH_RDBMS_NAME" $SU "$NH_USER" -c "$NH_HOME/bin/nhConvertDb $db" -f InstallNH.sh: display "Converting database $db" $SU "$NH_USER" -c "$NH_HOME/bin/nhConvertDb $db" InstallPatch.sh: echo "Converting database $NH_RDBMS_NAME" $SU "$NH_USER" -c "$NH_HOME/bin/nhConvertDb $db" -f and script nhConvertDb only needs database name as argument. So it looks the -f option in InstallPatch.sh and InstallLiveT.sh caused illegal option error, and it should be removed. Thanks 12/18/2000 6:16:34 PM yzhang for 11753, (the illegal option -f), I think this is something related to HP 64 bit. I did not see the message on saloris. I am install 471 on a HP in the lab, then upgrade to Patch 1, see if the message appear 12/18/2000 10:07:15 PM yzhang The problem of illegal option is caused by following -f option in the installPatch.sh. InstallPatch.sh: echo "Converting database $NH_RDBMS_NAME" $SU "$NH_USER" -c "$NH_HOME/bin/nhConvertDb $db" -f I reproduced the problem, then reinstalled the patch with modified script without -f option., and that error is gone. Farrell, you can tell customer tell customer that the patch they installed is successful even though the illegal option message. But if they want to reinstall the patch 471P1, they can follow the instruction from the attached README file and replace INSTALL.NH (in the patch directory) with the attached InstallPatch.sh, and rename InstallPatch.sh to INSTALL.NH, then run the install. Robin, If this need to be patched, which one it should go. 1/11/2001 7:16:20 AM foconnor Customer has closed call ticket we can close this bug. 1/11/2001 7:17:58 AM foconnor -----Original Message----- From: O'Connor, Farrell Sent: Thursday, January 11, 2001 7:08 AM To: Dong, Gavin Cc: O'Connor, Farrell Subject: Problem ticket 11753 Gavin, Customer has closed the associated Call ticket so we can close out the Problem ticket 11753. 11/29/2000 5:57:46 PM dkrauss Customer would like to be able to have multiple checkpoint locations to keep from overwriting the day before's save. Currenly only a save without checkpoints can be done like this. This is the workaround I gave the customer, but in the essence of space, customer would like checkpoint saves. Paul Dawson Empowered Networks E.W. Unsworth Brokers, Pearson International Airport, Toronto AMF, Ontario - CANADA phone: 613-271-7975 email: support@empowerednetworks.com 1/15/2001 9:21:03 AM wzingher Reassigned to UI team 9/1/2001 3:19:07 AM AR_ESCALATOR Administrative change. This ticket has been created as an Enhancement Request. 12/8/2000 1:24:32 PM foconnor 4.5.1 nethealth patch level 10 Unix ver HPuX10.20 Cusotmer was unable to save the database had to drop dialog table using the verifydb command and then saves were successful As per Bob keville said to bug this issue because of the problem with the iiatribute table. From the II_SYSTEM/ingres/files/sysmod.log file: Sysmoding database 'nethealth' . . . Modifying 'iiattribute' . . . E_US1200 Table name is not valid. (Fri Dec 1 13:59:02 2000) Sysmod of database 'nethealth' abnormally terminated. ################################################################################################################################## Excerpt from the $II_SYSTEM/ingres/files/errlog.log file: LDCPS01 ::[2582 , 406e0b00]: Wed Dec 6 12:22:38 2000 E_SC0216_QEF_ERROR Error returned by QEF. LDCPS01 ::[2582 , 406e0b00]: Wed Dec 6 12:22:38 2000 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. Associated error messages which provide more detailed information about the problem can be found in the error log, II_CONFIG:errlog.log LDCPS01 ::[2582 , 406e0b00]: Wed Dec 6 12:22:38 2000 E_QS001E_ORPHANED_OBJ An orphaned Query Plan object was found and destroyed during QSF session exit. LDCPS01 ::[2582 , 406e0b00]: Wed Dec 6 12:22:38 2000 E_QS0014_EXLOCK QSF Object is already locked exclusively. LDCPS01 ::[2582 , 406e0b00]: Wed Dec 6 13:01:07 2000 E_DM93A7_BAD_FILE_PAGE_ADDR Page 9 in table nh_dlg0_960346799, owner: concord, database: nethealth, has an incorrect page number: 29. Other page fields: page_stat 00000130, page_log_address (00000000,00000000), page_tran_id (0000000000000000). Corrupted page cannot be read into the server cache. LDCPS01 ::[2582 , 406e0b00]: Wed Dec 6 13:01:07 2000 E_DM920C_BM_BAD_FAULT_PAGE Error faulting a page. LDCPS01 ::[2582 , 406e0b00]: Wed Dec 6 13:01:07 2000 E_DM9C83_DM0P_CACHEFIX_PAGE An error occurred while fixing a page in the buffer manager. LDCPS01 ::[2582 , 406e0b00]: Wed Dec 6 13:01:07 2000 E_DM9204_BM_FIX_PAGE_ERROR Error fixing a page. LDCPS01 ::[2582 , 406e0b00]: Wed Dec 6 13:01:07 2000 E_DM92CB_DM1P_ERROR_INFO An error occurred while using the Space Management Scheme on table: nh_dlg0_960346799, database: nethealth LDCPS01 ::[2582 , 406e0b00]: Wed Dec 6 13:01:07 2000 E_DM9206_BM_BAD_PAGE_NUMBER Page number on page doesn't match its location. LDCPS01 ::[2582 , 406e0b00]: Wed Dec 6 13:01:07 2000 E_DM93A7_BAD_FILE_PAGE_ADDR Page 0 in table nh_dlg0_960346799, owner: concord, database: nethealth, has an incorrect page number: 20. Other page fields: page_stat 00000130, page_log_address (00000000,00000000), page_tran_id (0000000000000000). Corrupted page cannot be read into the server cache. LDCPS01 ::[2582 , 406e0b00]: Wed Dec 6 13:01:07 2000 E_DM920C_BM_BAD_FAULT_PAGE Error faulting a page. LDCPS01 ::[2582 , 406e0b00]: Wed Dec 6 13:01:07 2000 E_DM9C83_DM0P_CACHEFIX_PAGE An error occurred while fixing a page in the buffer manager. LDCPS01 ::[2582 , 406e0b00]: Wed Dec 6 13:01:07 2000 E_DM9204_BM_FIX_PAGE_ERROR Error fixing a page. LDCPS01 ::[2582 , 406e0b00]: Wed Dec 6 13:01:07 2000 E_DM9206_BM_BAD_PAGE_NUMBER Page number on page doesn't match its location. LDCPS01 ::[2582 , 406e0b00]: Wed Dec 6 13:01:07 2000 E_DM9264_DM1B_SEARCH Error occurred searching the Btree index. LDCPS01 ::[2582 , 406e0b00]: Wed Dec 6 13:01:07 2000 E_DM008E_ERROR_POSITIONING Error trying to position a table. LDCPS01 ::[2582 , 406e0b00]: Wed Dec 6 13:01:07 2000 E_QE0080_ERROR_POSITIONING Error trying to position a table. ################################################################################################################################# Conversation rollups failing: ---- Job started by Scheduler at '06/12/2000 06:00:50 PM'. ----- ----- $NH_HOME/bin/sys/nhiDialogRollup -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (06/12/2000 06:00:50 PM). Error: Sql Error occured during operation (E_QE0080 Error trying to position a table. (Wed Dec 6 13:01:07 2000) ). ----- Scheduled Job ended at '06/< 12/2000 06:01:08 PM'. ----- ################################################################################################################################## Collected the $II_SYTEM/ingres/files/*.log files, output of infodb, logdump and the output of verifydb -odbms_catalogs -mreport -sdbname nethealth command. //voyagerii/tickets/43000/43148..... 12/13/2000 5:21:10 PM wzingher reassigned from lemmon to robin 12/18/2000 1:57:04 PM rtrei Yulun-- Just look at the errlog.log for signs if this was a normal or abnormal shutdown. I will show you how to read this, if you have any questions. If abnormal shutdown, just close this. 1/3/2001 11:59:09 AM yzhang This is a sysmoding database problem with following error Sysmoding database 'nethealth' . . . Modifying 'iiattribute' . . . E_US1200 Table name is not valid. You mentioned that If abnormal shutdown idicated in errlog.log , just close this ticket. But I noticed the shutdown is normal. I renamed physical file for iiattribute in my testing, and I could not reproduce the error message. I checked with Farrel and he said every thing is fine with customer now, but he don't know if customer still has the sysmoding problem. I want to know if we still continue investigation on this problem. Thanks Yulun 1/3/2001 2:08:22 PM yzhang This ticket is closed, please have customer do a save/destroy/create/reload even if the customer appears ok. 12/13/2000 8:39:09 PM jpoblete User discovers server elements using a seed file, the first discover is OK, the second discovery will crash the console and all other nethealth proc's. Started proc's in debug mode: nhiServer -Dall start, the log shows the following: [s,cba ] Processing SIGCHLD signal ... [s,cba ] Detected program 504 died, status = 1 [s,cba ] Invoking SIGCHLD callback for pid = 504. [t,tb ] E:SpsExecuteList::programDiedCb (pid) [d,sps ] Program 504 died at 12/13/00 03:04:05 [t,tb ] E:SpsProgramDescription::SpsProgramDescription (pid) [t,tb ] X:SpsProgramDescription::SpsProgramDescription (pid) [D,sps ] Dead program: [D,sps ] file = D:/nethealth/bin/sys/nhiDbServer [D,sps ] pid = 504 [D,sps ] started = 12/13/00 02:27:38 [D,sps ] args = [D,sps ] restart = all [D,sps ] wait = 3 Attempted debugging for nhiDbServer -Dall but that just kills the machine used the standard debug flags from advanced logging, for the last capture used flags: -Dmall -Df Otc -Dt Collected traces for the following proc's: nhiServer nhiMsgServer nhiCfgServer nhiDbServer nhDiscover nhiPoller All related files and debugging traces are in the call ticket directory, on voyagerii 12/15/2000 3:55:28 PM bhinkel Brett, I've been asked to give this to you to do an initial evaluation and see if what Discover is passing to DB Server is correct. 12/18/2000 1:47:34 PM bedelson I've looked over several of the advanced trace logs and it's not clear what's causing the failure. Discovery and the CfgServer don't appear to have errors. The last operation performed by the DB server is the EsdGetElements call which I believe deals with the local cache. Forwarding to Robin for further inspection. 12/18/2000 2:43:14 PM rtrei Not clear to me that it is the DbServer that is crashing. 12/18/2000 3:07:32 PM jpoblete In all nhiServer debug traces, we could fing the following: s,cba ] Processing SIGCHLD signal ... [s,cba ] Detected program 504 died, status = 1 [s,cba ] Invoking SIGCHLD callback for pid = 504. [t,tb ] E:SpsExecuteList::programDiedCb (pid) [d,sps ] Program 504 died at 12/13/00 03:04:05 [t,tb ] E:SpsProgramDescription::SpsProgramDescription (pid) [t,tb ] X:SpsProgramDescription::SpsProgramDescription (pid) [D,sps ] Dead program: [D,sps ] file = D:/nethealth/bin/sys/nhiDbServer [D,sps ] pid = 504 [D,sps ] started = 12/13/00 02:27:38 [D,sps ] args = [D,sps ] restart = all [D,sps ] wait = 3 12/19/2000 10:00:57 AM rnaik Need log files from cust. support obtained by reproducing the problem while running dbServer with -Dmall -Df dDizc flags . 12/19/2000 3:36:26 PM jpoblete I have sent you the requested information. 12/20/2000 3:02:43 PM rnaik Hose, can you get log files obtained the following way. In startup.cfg, set arguments to -Dmall -Df dciz for nhiDbServer, nhiMsgServer, nhiCfgServer. Have them stop and re-start their nhServer and also have them set advanced logging for discover and console . Please get all the log files along with the system.log in $NH_HOME/log directory too. Thanks. 12/21/2000 2:06:04 PM jpoblete I have forwarded you the new files. 12/21/2000 4:23:57 PM rnaik Jose, thanks for the files. I found socket that dbServer is failing to write to is that of the Net Poller. Unfortunately we do not have the poller log files now. We need to ask the customer to reproduce the problem again for us and this time do everything that we did last time and also pass debug arguments for all 4 pollers (-Dm poller:cu:ccm:nwb:dsvr:csvr:msvr -Dfall) in startup.cfg file along with nhidbserver, nhimsgserver and nhiconfigserver (-Dmall -Df dciz) and ALSO turn on advanced logging for nhiConsole and nhiDiscover (-Dmall -Df dciz) in debuglog.cfg.. Can you also ask them to give us the system log i.e $NH_HOME/log/system.log. I spoke to dave shepard and acc. to him as long as the customers can reproduce the prob in half hour, passing -Dfall for the poller should be okay. 12/21/2000 4:31:37 PM rnaik sorry, can you just use the same flags for poller as for other servers ie. -Dmall -Df dDciz -Dt Thanks. 12/29/2000 9:22:35 AM rnaik I found out from cust. support that after increasing the memory from 500MB to 640MB, the customer is no longer able to reproduce the problem .Jose is going to follow up with them next week again and has asked them to keep an eye if it occurs again. 1/2/2001 2:13:15 PM jpoblete I'm trying to get a new status on this, they have not reported any problem after the memory upgrade. When the customer confirms that the problem is gone, we will close this one. 1/3/2001 4:03:06 PM jpoblete Still can't get a customer's response, will let you know when we can close this one. 1/4/2001 11:59:58 AM jpoblete Customer called me, he will test today the whole day and will let me know tomorrow. 1/8/2001 5:22:51 PM jpoblete have been trying to talk woth customer about this, but he has been unreachable will try tomorrow again. 1/9/2001 11:40:55 AM jpoblete Finally, customer agreed to close this one, they have not experienced any problem after the memory upgrade. 1/15/2001 9:41:51 AM rnaik In the next release need to make sure that the dbServer does NOT just "exit", if there is a failure to write to the socket. 2/23/2001 6:13:53 PM rnaik By looking at the code, this seems to be fixed. In 5.0, a "exitOnError" flag is passed to CcmSocket::sendMsg and if this flag is set to NO, CcmSocket::sendMsg doesn't exit when it recieves a Bad status from CcmUtil::ccmPutPacket.. Also in 5.0, ccmPutPacket retries a 100 times before sending a Bad status to sendMsg. So, Need to look in a little deeper to double verify and close. 3/1/2001 10:22:06 AM wzingher marking evaluated. 4/20/2001 12:41:32 PM pkuehne Changing 'Assigned Priority' to "High" 5/15/2001 6:00:50 PM tctang I've looked at this with Rupa and it seems code has been added since she looked at it. It now retires with a sleep interval so if this still fails, there is a bigger problem. K12/14/2000 4:08:53 PM jnormandin Porblem: Customer experienced a rollup failure, Append to nhStats1 table. He did not send in the rollup log, as we did the commands over the phone to retrieve the min and max, - Differance between MIN and MAX sample times were 2 stats0 tables. - I consulted with Tony P, and we decided to just drop the 2 stats0 tables that were not deleted during rollup - The tables were deleted and when we tried to run rollups again, we r< eceived the error: Append to table nh_stats1-967867199 failed. All rows were duplicates. - Verified disk space, there was plenty. - Contacted Yulun Zhang, apprised him of the situation, he requested: - current table listing - ingres error log. 12/15/2000 10:05:29 AM yzhang Jason, Run the attached two scripts, first run fix12071.sh, the run nhCleanDupStats.sh. Just typing the script name on command prompt after login as nh user and sourcing nethealthrc.csh. 1/4/2001 5:36:29 PM jpoblete Application Engineer never responded to us, please close this ticket 1/5/2001 4:26:09 PM yzhang ticket closed (12/15/2000 4:36:45 PM jpoblete Customer starts Network Health, and it starts polling, after the 2 or 3rd poll, the Statistics polling status stays on green and iimerge process uses 80% of cpu. However, looking at the top output the process nhiPoller seems to be sleeping: PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND 1105 ingres 35 0 3 36M 23M run 42.8H 79.87% iimerge 4727 neth 1 0 0 1880K 1336K cpu3 0:05 0.96% top 3202 neth 4 50 0 51M 44M sleep 1:17 0.00% nhiPoller 26502 neth 1 32 0 53M 38M sleep 0:57 0.00% nhiStdReport 3749 neth 1 11 0 53M 38M sleep 0:56 0.00% nhiStdReport 3201 neth 1 51 0 32M 27M sleep 0:25 0.00% nhiCfgServer 3199 neth 1 58 0 17M 13M sleep 0:07 0.00% nhiDbServer 1 root 1 58 0 704K 328K sleep 0:06 0.00% init 24993 neth 1 58 0 37M 22M sleep 0:05 0.00% nhiConsole 5143 neth 1 53 0 45M 30M sleep 0:04 0.00% nhiStdReport 27891 neth 1 58 0 45M 30M sleep 0:04 0.00% nhiStdReport 3204 neth 4 58 0 28M 21M sleep 0:03 0.00% nhiPoller 295 root 1 58 0 1760K 1072K sleep 0:02 0.00% sshd1 242 root 6 58 0 2368K 1904K sleep 0:01 0.00% vold 5142 neth 1 0 0 6736K 3784K sleep 0:00 0.00% nhiReport After 1 hour customer will get the message: Wednesday, December 13, 2000 06:58:13 PM Error (nhiMsgServer) 'Statistics' poller is not running (last poller activity at '12/13/2000 17:58:13'). Wednesday, December 13, 2000 07:58:13 PM Error (nhiMsgServer) 'Statistics' poll did not complete (poll started at '12/13/2000 17:58:13'). Got the following: - pollerHistory.out - poller.cfg - latest system log - files under pollerStatus directory. 12/15/2000 4:47:09 PM jpoblete Some additional info: Polled Elements: 3000 NH_SNMP_TIMEOUT = 4000000 NH_SNMP_RETRIES = 3 NH_SNMP_ALT_RETRIES = 2 NH_SNMP_DEVICE_TIMEOUTS = 15 NH_STAT_POLLS_PER_SECOND = 100 NH_POLLS_PER_SEC = 100 NH_POLL_BS = linear NH_POLL_ALT_BS = constant NH_POLL_ALL_AGENTS_THROTTLED = yes NH_POLL_AGENT_THROTTLE = 5 NH_POLL_PING_TIMEOUT = 1 NH_POLL_PING_MAX = 64 NH_POLL_PING_RETRIES = 4 NH_POLL_PING_DISABLED = No NH_CISCO_PING_TIMEOUT = 2000 NH_CISCO_PING_DISABLED = No 12/15/00 15:10:55 [d,poller] set wakeup timer to 76 seconds 12/15/00 15:12:11 [d,poller] doWakeup () invoked at 12/15/00 15:12:11 ... 12/15/00 15:12:12 [d,poller] doPoll () invoked at 12/15/00 15:12:12 ... 12/15/00 15:12:12 [d,poller] Pinging started at 12/15/00 15:12:12 ... 12/15/00 15:12:12 [d,poller] doWakeup () completed. 12/15/00 15:12:29 [d,poller] Pinging completed at 12/15/00 15:12:29. 12/15/00 15:12:29 [d,poller] ProcessPollList nOutstandingNdrs: 0 12/15/00 15:12:30 [d,poller] doPoll () completed at 12/15/00 15:12:30. 12/15/00 15:13:55 [d,poller] Device 128.254.237.10 timed out 2/15/00 15:13:57 [d,poller] Device 00:30:B6:17:6C:110 timed out Poller: (poll 2) poll took 122 seconds 12/15/00 15:14:13 [d,poller] Requests finished. 12/15/00 15:14:24 [d,poller] Closing database files. 12/15/2000 4:48:05 PM dshepard This just sounds like the usual availability backfill. If the servers had been down for a while, the pollers will attempt to backfill the availablility on the second poll after it comes up. If it has been down for a year, then it has to backfill a year of raw data. This can take a long time. Find out when is the last time they polled data prior to this restart. 12/15/2000 5:00:52 PM jpoblete This just started on Monday, does not always hung at the 3rd poll poller status stays on poll in progress, then customer becomes aware of this because the messages in the console and has to restart the servers the DB only contains data for the last 2 weeks. If this happens overnight the user find that the poller status is unknown. 12/15/2000 5:12:49 PM dshepard -----Original Message----- From: Poblete, Jose Sent: Friday, December 15, 2000 4:52 PM To: Shepard, Dave Subject: RE: Remedy 12093 This just started this Monday, does not always hung at the 3rd poll then customer becomes aware of this and has to restart the servers the DB only contains data for the last 2 weeks. -------------------------------------------------------------------------- If it is hanging this soon, that means the server was down. How long has it been since he was able to complete 3 polls. Does it ever get past it when he restarts the servers, or does this happen every time. If it is really hung, then we need a pollerHistory.out checkpoint file from sending it a kill -USR1. If he never gets past the second poll, then tell him to stop killing and restarting it. Just let it run for a while. One hour is probably not be enough, and he is just making the problem worse if it really is backfilling the availability. Given that the Ingres database is taking up the CPU time, that certainly looks like that is the case. 12/15/2000 5:14:04 PM dshepard As usual whenever we get a checkpoint dump, we'll need the poller.cfg file as well. 12/15/2000 5:21:49 PM jpoblete All the info is in the call ticket directory. 12/15/2000 6:31:43 PM dshepard This looks like a database problem. The poller is hung waiting for CdbTblsStats::loadDatabase() to return. The customer reports that iimerge is very busy. There are some nasty looking errors in the various Ingres errlog.log files that they provided. I am therefore assigning to the database group. 12/15/2000 7:17:31 PM jpoblete Got the ingres errlog.log, saw these messages: PDCD07-C::[32915 , 00000026]: Thu Dec 14 13:43:30 2000 E_CL100A_LK_INTERRUPT Locking system request was interrupted. PDCD07-C::[32915 , 00000026]: Thu Dec 14 13:43:30 2000 E_RD000C_USER_INTR User interrupts while requesting DMF function. PDCD07-C::[32915 , 00000026]: Thu Dec 14 13:43:30 2000 E_DM0065_USER_INTR User interrupt forced DMF operation abort. PDCD07-C::[32915 , 00000026]: Thu Dec 14 13:43:30 2000 E_GC0005_INV_ASSOC_ID Invalid association identifier PDCD07-C::[32915 , 00000040]: Thu Dec 14 15:50:40 2000 E_DM0065_USER_INTR User interrupt forced DMF operation abort. PDCD07-C::[32915 , 00000040]: Thu Dec 14 15:50:40 2000 E_QE0022_QUERY_ABORTED The query has been aborted. PDCD07-C::[32915 , 00000040]: Thu Dec 14 15:50:41 2000 E_GC0005_INV_ASSOC_ID Invalid association identifier PDCD07-C::[32915 , 00000053]: Thu Dec 14 16:54:39 2000 E_DM0065_USER_INTR User interrupt forced DMF operation abort. PDCD07-C::[32915 , 00000053]: Thu Dec 14 16:54:39 2000 E_QE0022_QUERY_ABORTED The query has been aborted. PDCD07-C::[32915 , 00000053]: Thu Dec 14 16:54:39 2000 E_GC0005_INV_ASSOC_ID Invalid association identifier PDCD07-C::[32915 , 0000005a]: Thu Dec 14 17:39:01 2000 E_DM0065_USER_INTR User interrupt forced DMF operation abort. PDCD07-C::[32915 , 0000005a]: Thu Dec 14 17:39:01 2000 E_QE0022_QUERY_ABORTED The query has been aborted. PDCD07-C::[32915 , 0000005a]: Thu Dec 14 17:39:01 2000 E_GC0005_INV_ASSOC_ID Invalid association identifier PDCD07-C::[32915 , 00000060]: Fri Dec 15 08:42:27 2000 E_DM0065_USER_INTR User interrupt forced DMF operation abort. PDCD07-C::[32915 , 00000060]: Fri Dec 15 08:42:27 2000 E_QE0022_QUERY_ABORTED The query has been aborte< d. PDCD07-C::[32915 , 00000060]: Fri Dec 15 08:42:27 2000 E_GC0005_INV_ASSOC_ID Invalid association identifier PDCD07-C::[32915 , 00000078]: Fri Dec 15 10:04:03 2000 E_DM0065_USER_INTR User interrupt forced DMF operation abort. PDCD07-C::[32915 , 00000078]: Fri Dec 15 10:04:03 2000 E_QE0022_QUERY_ABORTED The query has been aborted. PDCD07-C::[32915 , 00000078]: Fri Dec 15 10:04:03 2000 E_GC0005_INV_ASSOC_ID Invalid association identifier PDCD07-C::[32915 , 0000008e]: Fri Dec 15 13:21:56 2000 E_DM0065_USER_INTR User interrupt forced DMF operation abort. PDCD07-C::[32915 , 0000008e]: Fri Dec 15 13:21:56 2000 E_QE0022_QUERY_ABORTED The query has been aborted. PDCD07-C::[32915 , 0000008e]: Fri Dec 15 13:21:56 2000 E_GC0005_INV_ASSOC_ID Invalid association identifier PDCD07-C::[32915 , 0000009f]: Fri Dec 15 14:55:58 2000 E_DM0065_USER_INTR User interrupt forced DMF operation abort. PDCD07-C::[32915 , 0000009f]: Fri Dec 15 14:55:58 2000 E_QE0022_QUERY_ABORTED The query has been aborted. PDCD07-C::[32915 , 0000009f]: Fri Dec 15 14:55:58 2000 E_GC0005_INV_ASSOC_ID Invalid association identifier PDCD07-C::[32915 , 000000a7]: Fri Dec 15 16:49:46 2000 E_DM0065_USER_INTR User interrupt forced DMF operation abort. PDCD07-C::[32915 , 000000a7]: Fri Dec 15 16:49:46 2000 E_QE0022_QUERY_ABORTED The query has been aborted. PDCD07-C::[32915 , 000000a7]: Fri Dec 15 16:49:46 2000 E_GC0005_INV_ASSOC_ID Invalid association identifier 12/15/2000 8:52:45 PM jpoblete Found a couple of tickets with the same messages and were related to an insuficient ingres transaction log her ingres_log was about 600kb so we decided to resize it, then nhStartDb -boot failed on sysmod, so we decided to save / destroy / create / reload. Then we started the Nethealth proc's ans turned on the poller. 12/18/2000 9:17:56 AM rtrei Jose-- I was suspecting a symod duplicate error as soon as you started talking about the database hanging. Since you have done a destroy, create, reload, there is nothing more I can offer except that we will want to watch this for a few days. Remind the customer that they must shutdown the system properly: they must use shutdown or manually stop the database before issuing the reboot command. I will look over the errlogs, but don't think that there is more we can do than you have already done. Setting this to more info while we monitor customer. 12/21/2000 11:19:16 AM jpoblete everything is OK so far, please close this. 12/21/2000 2:26:01 PM rtrei closing. 12/15/2000 5:42:23 PM rkeville Dialog rollups failing, Error: Append to table nh_dlg1b_966311999 failed - 4.5.1 P14. - Supplied customer with the new nhiDlgRollup executable, they have replaced thiers with the new one and it has not resolved the problem. ################################################### 12/19/2000 2:06:13 PM rtrei Bob-- Let's get their database and take a look at it. Also, please do the following: echo "select table_name, create_date, num_rows from iitables where table_name like 'nh_dlg%'\g" | sql nethealth > tabletimes.out We've had a few cases where tables were created long after they should have been, so this will help with that. All the logs in $II_SYSTEM/ingres/files/*.log Especially make sure you get the errlog.log The Nethealth system messages. Save the messages using the NH console. All the files in $NH_HOME/logs (just tar up the entire directory, is easiest) 12/22/2000 11:32:04 AM rtrei Yulun, I am reassigning this to you in case the data comes in while I am on holiday. 12/28/2000 7:09:00 PM rkeville File and db are on the escalated tickets dir. ################################################### 1/3/2001 9:46:25 AM yzhang Can you have customer run the two attached scripts just by typing the script name from command prompt after login as nh user and sourcing nethealthrc.csh. run fix12095_1.sh first, then fix12095_2.sh. Run dialog rollup after executing the two scripts. Thanks Yulun 1/4/2001 5:13:02 PM rkeville I have requested the DB again from NPS, they will attempt to contact the customer tomorrow. ################################################### 1/11/2001 11:15:50 AM yzhang sent two scripts to customer to clean the database, and waiting their result. 1/11/2001 1:09:44 PM rkeville Sent the scripts to NPS for the customer to run this week. 1/12/2001 10:52:57 AM rkeville The scripts worked, close ticket. 1/12/2001 2:26:53 PM schapman Bob sent notice the scripts roeolved the problem. ^12/19/2000 3:33:33 PM tcordes Customer would like to see this command conform with UNIX standards and output to standard out instead of standard error. Customer: Peter Toye at EDS Network Services (through ML Enterprise) 9/1/2001 3:19:07 AM AR_ESCALATOR Administrative change. This ticket has been created as an Enhancement Request. p 12/21/2000 4:55:39 PM tcordes CPU utilization for Cisco 6509 indicates a constant ~60% util. rate. After rollups, At A Glance indicates a constant ~6% util rate. Even taking the rollup calculations into consideration, this is unlikely. Customer is currently polling a Cisco 6509, and is using a customer finder.tcl for certification work. This is important when considering asking the customer to patch, but has no bearing on data integrity. Customer is very sensitive: DBS (via Datacraft Asia) 12/21/2000 4:56:01 PM tcordes Reports, customer finder on Voyagerii 12/22/2000 11:27:57 AM rtrei Discussed this problem with Jay. It turns out that this is a cert issue and Nick Saparoff is the one who should handle this. Reassigning to him. I am also notifying SUpport that they should talk to him for an immediate status update. 12/22/2000 12:14:03 PM nsaparoff same as 11536: 1/2/2001 4:05:36 AM schapman Nick, I received the following reply in regards to this problem ticket. The reports are on voyagerii\tickets\43000\43131\Jan2. Please copy Farrell O'Connor as to what information you will require from the customer to aid in resolution of this issue. 1/2/2001 4:12:52 AM schapman . 1/3/2001 11:28:24 AM nsaparoff FYI - update on these tickets - the updated columnExpression.sys file (for 4.7.x) has been placed on VoyagerII in the escalated tickets directory. nhConvertDb will have to be run on the database. To force a rollup to occur: nhiRollupDb -now 1/5/00 21:00 As it stands, 4.7.x D03 is still good. Customers who we know are affected: Qwest communications, DBS Bank. Some Background: Element Type 252 was incorrectly added to the label table when SWii was released. Any incorrect mtfs using this element type and were not normalizing data (multiplying by deltaTime), would show correct RAW data until a rollup, when the columnExpression would be applied. The Fix: 1. fix all SWII sys cpu mtfs which aren't normalizing data 2. update the label tables with the correct columnExpression. What used to be this: 199|(100.0*(TR_CONTENTION_STREAMING/TR_BIT_STREAMING)) Should now become this: 199|(100.0*DELTA_TIME*(TR_CONTENTION_STREAMING/TR_BIT_STREAMING)) 1/8/2001 9:22:35 AM bhinkel Waiting for customer feedback. 1/8/2001 9:24:23 AM schapman -----Original Message----- From: Mary.John@Datacraft-Asia.com [mailto:Mary.John@Datacraft-Asia.com] Sent: Monday, January 08, 2001 12:03 AM To: Chan, Kok-Heng (Kenneth) Cc: Chapman, Sheldon; Satish.Satam@Datacraft-Asia.com Subject: RE: Call Ticket 43131 Problem Ticket 12172 Hi Sheldon The CPU elements have been deleted and the memory utilization is being monitored. We will know if the patch works after the roll-up scheduled for tomorrow. regards mary 1/10/2001 9:53:14 AM nsaparoff - 1/11/2001 5:02:01 PM bhinkel Also targeted for 4.8 P1. 12/28/2000 2:47:52 PM jpoblete Customer: Bear Sterns Issue: Conversation Rollup failing -rw-r--r-- 1 health comms 431 Dec 27 04:16 /opt/health/log/Conversations_Rollup.100001.log < /opt/health/log/Conversations_Rollup.100001.log: Error: Append to table nh_dlg1b_977374799 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 2 rows not copied because duplicate key detected. They do not have space problems yet. Checking disk spaces ... Filesystem kbytes used avail capacity Mounted on /dev/md/dsk/d3 3283334 2811438 439063 87% /opt For the table: nh_dlg1b_977374799 The min sample time is: 977302799 Wed Dec 20 03:59:59 2000 The max sample time is: 977374799 Wed Dec 20 23:59:59 2000 From the help output, there's just one table in the range: nh_dlg0_977302799 977302799 nh_dlg0_977302799 health table 977374799 nh_dlg0_977302799_ix1 health index nh_dlg0_977403599 health table nh_dlg0_977403599_ix1 health index nh_dlg0_977417999 health table nh_dlg0_977417999_ix1 health index nh_dlg0_977432399 health table nh_dlg0_977432399_ix1 health index nh_dlg0_977446799 health table nh_dlg0_977446799_ix1 health index The min, max sample times for table nh_dlg0_977302799 are: min: 977297400 Wed Dec 20 02:30:00 2000 max: 977574600 Sat Dec 23 07:30:00 2000 The dlg0 tables should have only 4 hours worth of data, at this point we don't know If dropping this table is the right thing to do. Spoke to Yulun Zhang at the Db team, He asked to log a problem ticket since this should not happen in Nethealth 4.7.1 P01. 12/28/2000 5:30:30 PM yzhang have customer drop table nh_dlg0_977302799, then run the following query: delete from nh_rlp_boundary where max_range = 977302799'and rlp_stage_nmbr = 0 and rlp_type = 'BD' 1/2/2001 11:47:11 AM jpoblete the problem is gone, please close this. 1/2/2001 1:03:05 PM yzhang problem is gone, ticket closed F1/4/2001 7:01:46 PM rkeville Customer needs to have changes to the rollup schedule logged in a log file for change history. - Needs it to include attributes such as user that changed it, the old settings and the new settings and anything else we can think of to put in there. - Customer is Chis Knowled from MCI. ################################################# 7/31/2001 4:13:04 PM bhinkel re-assigned to Joel since this is an improvement. 9/1/2001 3:19:08 AM AR_ESCALATOR Administrative change. This ticket has been created as an Enhancement Request. 1/5/2001 10:49:30 AM wburke iimerge running for over 24 hours @ 24% cpu. Background: Customer has other bug tickets. 1.NameNodes runs 24+ hours Runs script every 48 hours. iimerge starts after NameNodes is finished. usually finishes in 15 minutes, not this time. When this process runs, all other functionality, i.e. polling is unavailable. customer is down: 1/5/2001 10:50:21 AM wburke Attached is part of my syslog from today. I still am not polling correctly and iimerge is running. Job 'Conversations Rollup' finished (Job id: 100001, Process id: 10192). Friday, January 05, 2001 08:26:41 AM A scheduled poll was missed, the next poll will occur now (Conversations Poller). Friday, January 05, 2001 09:26:42 AM Error (nhiMsgServer) 'Traffic Accountant' poller is not running (last poller activity at '01/05/2001 08:26:42'). Friday, January 05, 2001 09:59:58 AM System Event (nhiConsole) Console initialization complete. uxhealth2% tail syslog15 -l 20 Friday, January 05, 2001 08:05:10 AM Starting job 'Conversations Rollup' . . . (Job id: 100001, Process id: 10192). Friday, January 05, 2001 08:06:01 AM Job 'Conversations Rollup' finished (Job id: 100001, Process id: 10192). Friday, January 05, 2001 08:26:41 AM A scheduled poll was missed, the next poll will occur now (Conversations Poller). Friday, January 05, 2001 09:26:42 AM Error (nhiMsgServer) 'Traffic Accountant' poller is not running (last poller activity at '01/05/2001 08:26:42'). Friday, January 05, 2001 09:59:58 AM System Event (nhiConsole) Console initialization complete. uxhealth2% The console has shown is showing poller status unknown. Still no additional data in ingres error log. Don 1/5/2001 12:40:48 PM rtrei Yulun-- We will want to discuss this with CA since this is a 4.7.1 system 1/8/2001 11:47:15 AM yzhang Walter, You can forwad these information to customer, I found this from ingres Q/A site. See if their problem is due to memory leak, or increaing swap space can solve their problem. Swap Space Drained > We just moved our primary database from 6.4 on SCO to ingres II on Solaris 7 > x86. The coversion went very well, but it hit the fan later. The iidbms > process appears to be draining the swap space. When we run out of > swap...crash. > > Top reveals that iimerge, with the pid of the iidbms, just keeps growing, > even at night when the database is mostly idle (it serves a web site). This > is also confirmed with > ps -elf | sort -n +9 -10 |tail -22 > > How can I stop the iidbms from consuming the swap space? > > Thanks, > Todd Boewe > Database Engineering > www.lioninc.com > > > From: Martin Bowes They suggest do the following: They have to login as nh_user, do the source nethealthrc.csh, then login as ingres do the following. 1) optimizedb 2) sysmod 3) Do whatever they suppose to do. If they still have the same problem after doing these. I will instruct them do ipm to see what the current sessions are doing. 2/8/2001 4:36:34 PM yzhang problem solved and ticket closed 1/9/2001 4:41:47 PM rkeville A Customer who has added some custom Applications to Traffic Accountant by adding them to the .usr files in the $NH_HOME/sys dir. They want to remove these applications and the data for them from the database. They dont want to loose any other TA data. Files are on voyagerii. 1/29/2001 10:03:48 AM wzingher Assigning to TA Group 1/29/2001 10:25:57 AM brad The product is working as designed. If this is to be addressed, it needs to addressed by the DB team. 2/20/2001 8:49:41 PM rkeville I have provided the script to the customer to do what they need done, you can close this call ticket. 2/21/2001 8:35:20 AM wzingher Customer received scripts from Bob Keville and the problem is solved. 1/15/2001 10:44:47 AM jnormandin Conversations rollup failure with following message: Error: Append to table nh_dlg1b_959065199 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 445 rows not copied because duplicate key detected. - Collected the following infor fro< m customer: ( From DB troubleshooting guide ver 3. page 7 ) - echo "select table_name, create_date from iitables\g" | sql nethealth > createDate.out - echo "select table_name, num_rows from iitables\g" | sql nethealth > numRows.out - All *.log files in tehe $II_SYSTEM/ingres/files directory. - The DLT of the problem table 959065199 is 23-may-00 02:59 The Database troubleshooting guide compiled by Robin Trei states resolution for this problem as follows: If the customer is 4.6 , get them to pacth 4 : The customer is running 4.71, so this avenue will not help Determine the UTC of the dlg1 table from the error message. The UTC of dlg1b_959065199 is Mat 23. Make a list of all the dlg0 tables that will need to be dropped. Worst case, it is all the dlg0 tables with a UTC less than that in the problem dlg1 table, but we can usually do better than that. See below All of the dlg0 tables have a UTC date greater than 23-may-00. In fact, the earlist UTC date on the dlg0 tables is from September 2000 Drop the tables and remove them from the nh_rlp_boundary table Since the UTC are all greater, which table should be dropped? The footprint for this second problem has been that we have some very small nh_dlg0 tables that were created after rollups were originally done. So, the easiest way to determine what tables to drop is to look at the sizes of the tables, and drop the smallest one. (It seems to have a pattern of some number of small tables, a medium sized table, and the remaining being the normal size for the site. Drop all the small tables.) There a a few dlg0 tables with a sixe of 0, there are also some of various other small sizes. Should the tables with the 0 size be dropped? What about the other small tables? A more absolute way is to compare the utc date and the create date from the table. If the utc time is several days before the create_date, then drop the table. The UTC of dlg1b_959065199 is Mat 23, while the create date listed is 2000-08-08. As a last ditch measure, you can drop all the prior dlg0 tables, but in this case the data loss will be severe and needs to be discussed with the customer. As stated above the UTC rule does not seem to apply in this case. 1/15/2001 5:18:23 PM yzhang Jason Let's get their database and take a look at it. Also, please do the following: echo "select table_name, create_date, num_rows from iitables where table_name like 'nh_dlg%'\g" | sql nethealth > tabletimes.out All the logs in $II_SYSTEM/ingres/files/*.log Especially make sure you get the errlog.log The Nethealth system messages. Save the messages using the NH console. All the files in $NH_HOME/logs (just tar up the entire directory, is easiest) 1/24/2001 10:38:30 AM jnormandin - TA data was removed from the DB so this is no longer an issue. %$1/18/2001 5:10:09 AM tstachowicz Error from fetch log: Cleaning up merge files [: mergeDbCleanup 1394: mergeDatabase 1339: mergeDbs 1037: C:/nethealth/bin/nhFetchDb.sh 1882: expression syntax error [: mergeDbCleanup 1394: mergeDatabase 1339: mergeDbs 1037: C:/nethealth/bin/nhFetchDb.sh 1882: expression syntax error [: mergeDbCleanup 1394: mergeDatabase 1339: mergeDbs 1037: C:/nethealth/bin/nhFetchDb.sh 1882: expression syntax error [: mergeDbCleanup 1394: mergeDatabase 1339: mergeDbs 1037: C:/nethealth/bin/nhFetchDb.sh 1882: expression syntax error [: mergeDbCleanup 1394: mergeDatabase 1339: mergeDbs 1037: C:/nethealth/bin/nhFetchDb.sh 1882: expression syntax error Done merging database nethealth. ------------------------------------------ This does not seem to have an affect on the merge. The data is imported correctly. After speaking with Robin she has suggested to collect everything in $NH_HOME/log and pass this to Yulun. FACTS: The central machine is: hostname: Appau30020, NT4.0 sp5, NH version 4.7.1 The remote machine is: hostname: au1055s HP Unix Version 11 BACKUP INFO: voyagerii/escalated tickets/43000/43570/fetch.log voyagerii/escalated tickets/43000/43570/Appau30020/Nt_log directory voyagerii/escalated tickets/43000/43570/au1055s/Unix_log directory voyagerii/escalated tickets/43000/43570/system.log voyagerii/escalated tickets/43000/43570/netrc_machines 2/13/2001 11:23:57 AM rrick Yulun, This is the email I just sent to our reseller in Australia. Any news on what you think the following messages are: [: mergeDbCleanup 1394: mergeDatabase 1339: mergeDbs 1037: C:/nethealth/bin/nhFetchDb.sh 1882: expression syntax error [: mergeDbCleanup 1394: mergeDatabase 1339: mergeDbs 1037: C:/nethealth/bin/nhFetchDb.sh 1882: expression syntax error [: mergeDbCleanup 1394: mergeDatabase 1339: mergeDbs 1037: C:/nethealth/bin/nhFetchDb.sh 1882: expression syntax error [: mergeDbCleanup 1394: mergeDatabase 1339: mergeDbs 1037: C:/nethealth/bin/nhFetchDb.sh 1882: expression syntax error [: mergeDbCleanup 1394: mergeDatabase 1339: mergeDbs 1037: C:/nethealth/bin/nhFetchDb.sh 1882: expression syntax error Russell K. Rick 3/22/2001 10:48:36 AM rrick -----Original Message----- From: Shane Burdan To: support@concord.com Sent: 03/13/2001 6:39 PM Subject: Ticket: 43570 Hello, The customer is asking for a time for this bug to be fixed as it seriously affecting their distribute polling. This includes 1) The high number of duplicate errors and 2) The alias names for certain routers not fetched across. What is the status of this ticket? Kind Regards, Shane Burdan Customer Support Engineer Phone: +612 99650608 Fax: +612 99290411 email: support@ipperformance.com.au ________________________________________________________________________ _ Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com. -----Original Message----- From: Rick, Russell Sent: Wednesday, March 14, 2001 12:12 PM To: 'Shane Burdan ' Cc: Gray, Don; Keville, Bob Subject: RE: Ticket: 43570 Yulun, Any news on this ticket yet? Thanks again, - Russ Rick 3/23/2001 11:33:49 AM rrick NILM with Yulun to call me back. 4/23/2001 2:22:18 PM rrick -----Original Message----- From: Rick, Russell Sent: Monday, April 23, 2001 2:12 PM To: 'sburden@ipperformance.com.au' Cc: 'David Scott' Subject: RE: Call Ticket #43570 Hi Gentlemen: Customer: - ANZ Bank/IPM Solutions, Shane Burden. Problem: - Customer receives syntax errors from the mergeDbs function of the nhFetchDb.sh script. - Customer receives duplicate key detected messages in the fetch log. - Customer stated some of their aliases are not being brought over from the remote machines. Status: - A bug has been submitted to address the syntax and duplicate key messages in the fetch log. - Engineering has requested a fetch be run using a debug flag to capture error output that is not normally output to the fetch log. This will determine where the syntax is incorrect and if the duplicate is located in a stats table or the nh_element table. - To address the question of the alias's not being brought over I will get a dump of the nh_elem_alias table from each machine and compare them. Instructions: 1. Please have your customer re-execute the nhFetchDb using the following command line syntax and send the out to support@concord.com, ATTN: Russ Rick: sh -x nhFetchDb [please add the appropiate nhFetchDb parms that you used at the customer site, that produced the errors, originally] >& 43570Fetch.out 2. Please dump the nh_elem_alias tables on both servers. Execute the following script and then forward this output, as well: Please perform the following: Please download or FTP this attached file of the "dumpAliasTable" script to the $NH_HOME directory on both servers. NOTE: If you are using FTP to download this file to a Unix box from a Windows NT box please make sure to FTP in binary format. NOTE: This file is NOT in zipped-up or compressed format. Login into the Network Health System as the nethealth user. Command line syntax: sh dumpAliasTable.sh The outp< ut will be named "43570Alias.out" in the same directory Thanks again for all your patience, Russell K. Rick 4/26/2001 10:36:05 AM rrick -----Original Message----- From: David Scott [mailto:dscott@ipperformance.com.au] Sent: Tuesday, April 24, 2001 1:02 AM To: Rick, Russell Cc: Peter T Bartlett; sburden@ipperformance.com.au Subject: RE: Call Ticket #43570 Hello Russ, Just regarding the configuration at ANZ - The MELBOURNE CENTRAL MACHINE is an NT box. The UK REMOTE POLLER is an NT box. The MELBOURNE REMOTE POLLER is a HP-UX machine. I'm doubtful that I have collected what your after with command - sh -x nhFetchDb [please add the appropriate nhFetchDb parms that you used at the customer site, that produced the errors, originally] >& 43570Fetch.out Firstly, the command you supplied above for the Fetch failed to work. If I ran it as you suggested from an MSDOS prompt on the CENTRAL NT machine I received - The handle could not be duplicated during redirection of handle 1 If I invoked a shell first and then tried the command I received - bad file description When I corrected the output redirection specification and tried again, I generated an output file which contained only 1 line indicating that nhFetchDb could not be found. So I substituted the command with the following - sh -x -c nhFetchDb > 43570Fetch.out 2>&1 This worked, but I fear that I did not get all the output you were hoping for. Also the Alias Table dump only seemed to produce info when run on the HP-UX machine. All the associated files are included in the attached zip file. When you answer this email could you please specify the exact command syntax and steps that you need me to perform. Regards, David Scott. -----Original Message----- From: David Scott [mailto:dscott@ipperformance.com.au] Sent: Thursday, April 26, 2001 7:03 AM To: Support Concord Cc: Peter T Bartlett Subject: FW: Call Ticket #43570 attention Russ Rick Hello Support and Russ, I'm emailing to follow up on the status of this issue. The diagnostic collection commands you suggested to use may not have produced the output you required. Could you please advise the next course of action on the issue as it has been outstanding for some time and we would like to keep it moving towards a resolution. We await your next suggestion. Regards, David Scott. 5/10/2001 9:20:42 AM yzhang Russell, If they are on NT, try command similar to following, and send me the output file. I tested and this command worked sh -x nhFetchDb.sh -p D:\nh5.0\db\remotePoller -rh sulfur -nr 3 > test.out Yulun 5/11/2001 10:08:22 AM rrick -----Original Message----- From: Shane Burdan [mailto:s_burdan@hotmail.com] Sent: Thursday, May 10, 2001 9:21 PM To: support@concord.com Subject: Ticket: 43570 Hi Russ, The customer reports the output file contained no data, so the did not send it however they have sent the telnet output instead. Thank you, Kind Regards, Shane Burdan Customer Support Engineer Phone: +612 99650608 Fax: +612 99290411 email: support@ipperformance.com.au _________________________________________________________________________ Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com. -----Original Message----- From: Rick, Russell Sent: Friday, May 11, 2001 9:59 AM To: Zhang, Yulun Subject: FW: Ticket: 43570 FYI - This is the command you gave me yesterday for ticket #12382 Russell K. Rick 5/14/2001 1:33:22 PM yzhang run this from central machine after login as nhuser and sourcing nethealthrc.csh echo "delete from nh_elem_assoc where element_id < 1 and element_id > 13000000\g" | sql $NH_RDBMS_NAME 6/11/2001 5:44:33 PM yzhang closed, the fix has been checked in for nh5.0, and workaround has been proved for the syntax expression error 1/19/2001 7:26:46 PM cbjork This is the scenario: When running $NH_HOME/bin/nhSaveDb -u nhuser -p /some_directory_that_doesn't_exist", the job doesn't run due to fact that directory doesn't exist on partition or path, and information indicating such PATH DOESN'T EXIST error is not written to the Save.log The customer would like to see some kind of informational error indicating as such written into the log under this type of circumstance. customer: Southwestern Bell Internet 001495 7/31/2001 4:26:58 PM bhinkel re-assigned to Joel - improvement. 9/1/2001 3:19:08 AM AR_ESCALATOR Administrative change. This ticket has been created as an Enhancement Request. In1/30/2001 11:37:14 AM schapman The customer has experienced an number of database inconsistencies. Some have been associated with Database Fetches: Job started by Scheduler at '2000/12/21 09:30:28'. nhFetchDb -p /var/nethealth/idb/remotepoller/nethealth -rh adelmo-nh ### Beginning Fetch Thu Dec 21 09:30:29 MET 2000 ENTRY> adelmo-nh neth /var/nethealth/idb/remotepoller nethealth Connecting to host adelmo-nh Host adelmo-nh is alive FTP connection successful to host adelmo-nh Copying files from host adelmo-nh adelmo-nh::/var/nethealth/idb/remotepoller/Remote.tdb.12-21-2000_09.00.14 Done copying files from adelmo-nh Disconnecting from host adelmo-nh Disconnected from host adelmo-nh ### Beginning Merge Thu Dec 21 09:30:41 MET 2000 Deleting the following element ids from the central database: INGRES TERMINAL MONITOR Copyright c 1981, 1998 Computer Associates Intl, Inc. ### Beginning Merge Thu Dec 21 09:30:41 MET 2000 Deleting the following element ids from the central database: INGRES TERMINAL MONITOR Copyright c 1981, 1998 Computer Associates Intl, Inc. (0 rows) From 2000196 to 2018098. Removing element and analyzed data after 977363929 INGRES TERMINAL MONITOR Copyright c 1981, 1998 Computer Associates Intl, Inc. (15275 rows) (15275 rows) (0 rows) (8077 rows) (15275 rows) Checking for duplicate element names and inserting elements ... Error: Append to table nht_dst_element failed, see the Ingres error log file for more information (E_US125C Deadlock detected, your single or multi-query transaction has been aborted. (Thu Dec 21 03:34:31 2000) ). Adding remote element association, element alias and latency data ... Adding element data from files lat_b41/nea_b23/els_b45/mtf_b45 ... INGRES TERMINAL MONITOR Copyright c 1981, 1998 Computer Associates Intl, Inc. (0 rows) (8077 rows) E_US0026 Database is inconsistent. please contact the system manager (Thu Dec 21 03:35:02 2000) E_US0026 Database is inconsistent. please contact the system manager (Thu Dec 21 03:35:02 2000) Later on they experienced an inconsistency that appears to be related to a resource deadlock. There was a resource deadlock with the iirelation table ABBONE ::[48689 , 0000067a]: Thu Jan 25 04:08:09 2001 E_DM0042_DEADLOCK Resource deadlock. ABBONE ::[48689 , 0000067a]: Thu Jan 25 04:08:09 2001 E_QE002A_DEADLOCK Deadlock detected. ABBONE ::[48689 , 0000067a]: Thu Jan 25 04:08:09 2001 E_DM9042_PAGE_DEADLOCK Deadlock encountered locking page 466 for table iirelation in database nethealth with mode 5. Resource held by session [16738 674]. ABBONE ::[48689 , 0000067a]: Thu Jan 25 04:08:09 2001 E_DM0042_DEADLOCK Resource deadlock. ABBONE ::[48689 , 0000067a]: Thu Jan 25 04:08:09 2001 E_DM960F_DMVE_REP Error recovering REPLACE operation. ABBONE ::[48689 , 0000067a]: Thu Jan 25 04:08:09 2001 E_DM967A_DMVE_UNDO An error occurred during UNDO recovery, LSN: <976880993,77329776>. ABBONE ::[48689 , 0000067a]: Thu Jan 25 04:08:09 2001 E_DM9639_DMVE_UNDO An error occurred during UNDO recovery. ABBONE ::[48689 , 0000067a]: Thu Jan 25 04:08:09 2001 E_DM9509_DMXE_PASS_ABORT Transaction rollback has failed and has been passed to the DMFRCP (PASS ABORT, ID: 00003A3A3A65715E). ::[II_RCP , 00000005]: Thu Jan 25 04:08:12 2001 E_DM9042_PAGE_DEADLOCK Deadlock encountered locking page 466 for table iirelation in database nethealth with mode 5. Resource held by session [16738 674]. ::[II_RCP < , 00000005]: Thu Jan 25 04:08:12 2001 E_DM0042_DEADLOCK Resource deadlock. ::[II_RCP , 00000005]: Thu Jan 25 04:08:12 2001 E_DM960D_DMVE_PUT Error recovering PUT operation. ::[II_RCP , 00000005]: Thu Jan 25 04:08:12 2001 E_DM9638_DMVE_REDO An error occurred during REDO recovery. ::[II_RCP , 00000005]: Thu Jan 25 04:08:12 2001 E_DM9439_APPLY_REDO Error applying REDO operation. ::[II_RCP , 00000005]: Thu Jan 25 04:08:12 2001 E_DM943D_RCP_DBREDO_ERROR Recovery error on Database nethealth. Error occurred applying Redo recovery for log record with LSN (976880993,77328632). Recovery will be halted on this database while the RCP attempts to successfully recover other open databases. ::[II_RCP , 00000005]: Thu Jan 25 04:08:12 2001 E_DM943B_RCP_DBINCONSISTENT Database (nethealth, neth) being marked inconsistent by the recovery process. The database could not be successfully restored following a system, process, or transaction failure. The database should be restored from a previous checkpoint. The db was unable to move forward with the transaction The db was unable to roll back the transaction Being unable to neither complete nor undo the transaction, the db put up a flag stating it was inconsistent Customer reinstalled Ingres 2.0 after database became inconsistent. Then the nhLoadDb process was started, which took almost 12 hours to complete. Next, they tried nhResizeIngresLog 1000 for the Transaction Log file and the errlog.log file reported locking problems. ABBONE ::[54137 , 0000000a]: Fri Jan 26 03:43:34 2001 E_SC0271_EVENT_THREAD The SCF alert subsystem event thread has been altered. The operation code is 0 (0 = REMOVE, 1 = ADD, 2 = MODIFY). ABBONE ::[54137 , 00000004]: Fri Jan 26 03:43:34 2001 E_SC0235_AVERAGE_ROWS On 2671. select/retrieve statements, the average row count returned was 1. ABBONE ::[54137 , 00000004]: Fri Jan 26 03:43:34 2001 E_SC0128_SERVER_DOWN Server Normal Shutdown. ABBONE ::[54137 , 00000004]: Fri Jan 26 03:43:34 2001 E_CL2518_CS_NORMAL_SHUTDOWN The Server has terminated normally. ABBONE ::[54137 , 00000001]: Fri Jan 26 03:43:35 2001 E_SC0235_AVERAGE_ROWS On 2671. select/retrieve statements, the average row count returned was 1. ABBONE ::[54137 , 00000001]: Fri Jan 26 03:43:35 2001 E_SC0128_SERVER_DOWN Server Normal Shutdown. ABBONE ::[54137 , 00000001]: Fri Jan 26 03:43:35 2001 E_CL2518_CS_NORMAL_SHUTDOWN The Server has terminated normally. ::[II_ACP , 00000001]: Fri Jan 26 03:43:35 2001 E_DM9815_ARCH_SHUTDOWN Archiver was told to shut down. ::[II_RCP , 0000000b]: Fri Jan 26 03:43:36 2001 E_DMA475_LGALTER_BADPARAM A stale or inconsistent logging system handle was passed to LGalter(). The actual object type was 0, and the actual object re-use counter was 2, but the provided handle re-use counter was 1. The LGalter() function code was LG_A_CPFDONE. ::[II_RCP , 0000000b]: Fri Jan 26 03:43:36 2001 E_CL0F04_LG_BADPARAM Bad input parameter passed to LG routine ::[II_RCP , 0000000b]: Fri Jan 26 03:43:36 2001 E_DM900B_BAD_LOG_ALTER Error altering the characteristics of the logging system, characteristics: 00000029. ::[II_RCP , 0000000b]: Fri Jan 26 03:43:36 2001 E_DMA416_LGEND_BAD_XACT A stale or inconsistent logging system handle was passed to LGend(). The actual object type was 0, and the actual object re-use counter was 2, but the provided handle re-use counter was 1. ::[II_RCP , 0000000b]: Fri Jan 26 03:43:36 2001 E_CL0F04_LG_BADPARAM Bad input parameter passed to LG routine ::[II_RCP , 0000000b]: Fri Jan 26 03:43:36 2001 E_DM900E_BAD_LOG_END Error trying to end the transaction: 0001000A. ::[II_RCP , 0000000b]: Fri Jan 26 03:43:36 2001 E_CL1037_LK_RELEASE_BAD_PARAM LKrelease() failed due to a lock id bad parameter; input lock id = 12; the system lock type = 0; the system id = 12. ::[II_RCP , 0000000b]: Fri Jan 26 03:43:36 2001 E_CL1003_LK_BADPARAM Bad parameter(s) passed to routine ::[II_RCP , 0000000b]: Fri Jan 26 03:43:36 2001 E_DM901B_BAD_LOCK_RELEASE Error releasing the lock list: 0000000C. ::[II_RCP , 0000000b]: Fri Jan 26 03:43:36 2001 E_DMA41B_LGREM_BAD_DB A stale or inconsistent logging system handle was passed to LGremove(). The actual object type was 0, and the actual object re-use counter was 2, but the provided handle re-use counter was 1. ::[II_RCP , 0000000b]: Fri Jan 26 03:43:36 2001 E_CL0F04_LG_BADPARAM Bad input parameter passed to LG routine ::[II_RCP , 0000000b]: Fri Jan 26 03:43:36 2001 E_DM9016_BAD_LOG_REMOVE Error removing the database: 0001008C from the logging system. ::[II_RCP , 0000000b]: Fri Jan 26 03:43:36 2001 E_DM0116_FAST_COMMIT An error occurred in the DMF Fast Commit procedure. Server can no longer support Fast Commit protocol. abbone ::[54114 IIGCN, 00000000]: Fri Jan 26 03:43:42 2001 E_GC0152_GCN_SHUTDOWN Name Server normal shutdown. abbone ::[54446 IIGCN, 00000000]: Fri Jan 26 03:44:47 2001 E_GC0151_GCN_STARTUP Name Server normal startup. They were told to sysmod the DB to prevent these type of locking errors from occurring. First it was run once a week and then twice a week, but they are still seeing the deadlocks and they would like to get an explanation why this is occurring and what they can do to prevent this in the future. All related logs and files are on voyager\tickets\44000\44831 1/31/2001 12:05:35 PM yzhang The ticket was created with the following text: This is a problem regarding database inconsistance which may be caused from deadlock. We have some information or log files for you to look at, so we will need an upload directory. ingres version is OI 2.0/9712 (su4.us5/00) with patch 6641. Thanks Yulun Zhang 1/31/2001 1:26:13 PM schapman forwarded the overflow output to Yulun 1/31/2001 5:13:03 PM yzhang Just let you know that iitable.out, and nhOverFolwTable.tar obtained from our customer today have been put into the CA's FTP site. The errlog.log from the escalated directory is also put into CA's site. issue number is: 10667841 2/1/2001 10:30:50 AM schapman Forwarded the latest nhCollectCustData to Robin 2/1/2001 1:24:12 PM rtrei Response to numerous messages from KPN. Sent to Don, Sheldon to pass on or not as needed. Information sent up to CA. Sheldon trying resetting shared memory . Hello, This is in response to several questions and observations KPN has had with regard to netHealth's use of the Ingres database system. I have italicisized the original comments and have put my answers below the most relevant areas. I appreciate KPN's concern and the time they have taken to raise these issues. Concord Communications is very concerned with delivering high quality products. Hello to you all, I once more will try to explain what is happening on our NetHealth systems. As you already should know, a database can be accessed only by SQL. I don't know the source code for netHealth, but this means there are probably two options: 1) The NetHealth application opens a permanent SQL session (port) or 2) The NetHealth application opens lots of SQL sessions all the time. (Of course, this all happens in the background.) So, let us review what happens when an (Ingres) SQL session has been opened: The SQL buffer is filled with SQL statements, for example: SQL> create table nh_stats0_987654321........... ; SQL > create index ............. ; SQL > insert into nh_stats0_987654321............ ; SQL > update nh_element..................; etcetera. Each SQL statement ends with a semicolon, the ";" is the command separator. Where Oracle immediately executes those commands and places an entry in its rollback, Ingres does it in another way. Ingres waits voor a 'GO' command ( "\go" or "\g") before execution starts. For all deletes, inserts and updates the database then puts a lock o< n the records involved. Ingres has a max_record_lock setting, when this is exceeded the table as a whole will be locked. Ingres also has a max_tables_lock setting, and even a max_transaction_lock setting. When locks are not being cleared during processing, Ingres will run out of some max_lock constraint after a period of time! When this happens, Ingres can't finish the processing and tries a ' rollback '. The rollback will fail also because there are no more locks available. Therefore Ingres tries a redo, and then a rollback, and a redo again, and so on. This process keeps on running 'infinitely', until it results in a deadlock situation at the end. It will also result in lots of overflow! And at the same time there is a chance that the Transaction Log will grow out of size! This is because all transactions are logged in the Transaction Log for rollback purposes. So, the question is: How can we clear these locks and free up some space in the Transaction Log? Well, that's easy! Just give a ' commit ' ( "\commit" ) every now and then. (Note: Quitting a SQL session automatically results in a commit.) As soon as a commit has been passed to SQL, the changes become definite. After such a 'commitment' the locks and transaction logs are no longer needed. So, they are cleared / freed and become available again for other processes. So, next question is: Why do we have more problems on the Central than on the Pollers? Well, most of the time the pollers are only inserting new records to new stats tables! Therefore lock problems are less likely. On the Central system however, for example the 4 fetches are updating the same stats tables. So, in this case lock problems are far more likely! Possible causes: (1) There are not enough "\commit" statements in the background to clear locks and free up transaction log space. (2) The lock modes are not set right for SQL sessions ( SQL> ' set lockmode session ......... ' ). (3) There is a shared memory problem causing the lock problems (caused by DiskSuite for example?). (4) Something else........ 1) We actually go to significant effort to be sure that the transactions are sized correctly to accomplish the smallest unit of work possible. However, the purpose of a transaction is to be sure that a whole segment of work is accomplished or not accomplished in its entirety, so adding additional 'commit' statements is sometimes not feasible. Also, wherever feasible we do carefully controlled bulk inserts which do lock the entire table but which are significantly faster than a series of inserts. I have asked support to be sure your remote polling processes are not all inserting into the central database at the same time. 2) All netHealth processes should have locking set such that a read does not cause an inadvertant lock. It is possible that some process does not do this, so if you notice a particular process that seems to be causing the problem, please notify me of the process name immediately and I will check into it. It is possible for deadlocks to occur, as it is with any database system. However, we usually have fail-safes in the Nethealth code base that retries or attempts some other recovery mechanism. So, it is possible to see Deadlock error messages in the Ingres errlog.log and not have had it impact the netHealth system in any way. I have given Concord's technical support guidelines on how to evaluate an errlog.log for this situation and when to be concerned and when not. The errlog.log is usually collected for any database related problem and it is scanned for any problems, not just the problem the customer reports. The only time I have encountered a problem similar to what you have described was when we were testing a new Ingres patch from CA which we refused to accept. We are actively reviewing this new occurance with CA. Note: On our least significant Poller I experimented a little with Ingres settings (config.dat) and SHM settings. This system then proved to be far more stable than the others..... Moreover, it had a better performance. But I have to admit, changing Ingres and/or Kernel settings probably will only delay the problem occurences. Thanks and Best Regards, Ed Donath Oracle DBA Integrated Software Solutions Operational Support Unit Broadband Access and IP Networks KPN Telecom, The Netherlands KPN-040001 The OpenIngres database very often becomes inconsistent! This is not only a fact for the NETHEALTH database, but for the IIDBDB system database as well!! As a result, the Network Health application is not out-of-the-box, because the Network Health tools like nhDestroy, nhCreateDb or nhForceDb won't work when the system database itself becomes inconsistent. OpenIngres database inconsistencies are fairly easy to prevent. We only need to run the Ingres DBA utility sysmod every now and then! This utility can be found in the directory $NH_HOME/idb/ingres/bin. Possible improvement: Add the running of sysmod to the nhReset -db command or create for example a nhReset switch -dbs which takes care of this. And maybe it is not a bad idea to add a sysmod command to the /etc/init.d/nethealth.sh and/or the nhStartDb scripts as well.. Priority: A Submitted: January 2001 Concord-ID: N/A Status: N/A There is a logical which can be set that will have cause nhReset to run sysmod. I have asked Concord's Technical Support to contact you about this. If the iidbdb database becomes inconsistent, please use the command "rollforwarddb iidbdb". The nhForceDb command is only for the nethealth database. Lastly, please review your shutdown procedures and make sure that the nethealth.sh stop command in /etc/init.d is being allowed to run. The major cause of inconsistent databases in Ingres stems from improper shutdowns. KPN-040002 The Nethealth database produces a lot of deadlock errors which results in poor performance and even database crashes! It is fairly easy to reproduce a deadlock situation as I already showed in previous e-mails. Again, it is fairly easy to prevent all those deadlock occurences! Adding a few commit statements within called SQL procedures will clear pending locks right away! Another way of doing this is the statement set lockmode session where readlock = nolock at the beginning of each SQL session. But a far more systemwide solution is of course changing the dbms.*.system_readlock: share setting to nolock in the OpenIngres configuration file $NH_HOME/idb/ingres/files/config.dat. Possible improvement: Review the locking modes for the Nethealth SQL procedures or change the locking modes - especially for readlocks - systemwide as mentioned above. Priority: A Submitted: January 2001 Concord-ID: N/A Status: N/A All Nethealth processes should have lock mode set such that a read does not cause a lock. Please notify me of any specific process that does not seem to have this set. KPN-040003 The Ingres timezone is automatically set to NA-EASTERN by the Network Health Installation Program. This results in a 6 hour time difference between timestamps in SQL output and the actual local time. As we are running customized queries on the Network Health servers as well, it should be possible to change the timezone. Note that you will find the Ingress timezone setting in $NH_HOME/idb/ingres/files/symbol.tbl. So the question now remains, what impact a timezone change will have on the Network Health application. Possible improvement: Provide some information on Ingres timezone settings and how a change of the Ingres timezone affects the Network Health application. Priority: B Submitted: January 2001 Concord-ID: N/A Status: N/A All time information is stored as a UPC within the database. Nethealth uses its own timezone logicals to report the time out correctly in its processes. The Ingres database uses another. There is a utility in $NH_HOME/bin called nhOITimeZoneUpdate which may be used to set it to a different time zone. KPN-040004 OpenIngres version 2.0 should be capable of supporting multi-processor environments. Note the II_NUM_OF_PROCESSORS entry in $NH_HOME/idb/ingres/files/symbol.tbl.< So this raises the question - also from the Performance Management's point of view - how to implement multi-processor support for OpenIngres and Network Health. Possible improvement: Provide some information on the implementation of multi-processor support for OpenIngres and Network Health. Priority: B Submitted: January 2001 Concord-ID: N/A Status: N/A According to discussions I have had with CA, the II_NUM_OF_PROCESSORS is used to make resource decisions on multi-processor systems. It is an easy paramter to set or reset at the customer site. KPN-050001 The default OpenIngres parameters are by far not sufficient to support very dynamic or large Network Health environments! This often causes the Nethealth database to crash! Especially the rcp.lock.per_tx_limit - set by default to 700 - in the OpenIngres configuration file $NH_HOME/idb/ingres/files/config.dat caused a lot of database crashes during large polling discovery processes. Improvement: Adjust a few important settings in the OpenIngres configuration file config.dat. This can be done quite easily with the Ingres CBF (Configuration-By-Forms) utility. Probably within a few days I will propose a few interesting settings.. Priority: A Submitted: January 2001 Concord-ID: N/A Status: N/A We are understandably cautious about making changes in the config.dat file. Each paramater has been discussed with CA. And, as long as it passes our performance testing for the published sizes, we do not see the need to make changes. However, I look forward to seeing your proposal and will consider them carefully and with respect. KPN-050002 The Network Health database design is responsible for the creation of lots of overflow chains! The overflow chains slow concurrent performance because they will increase I/O, cause concurrency problems, and use up locking system resources! Read all about it in the Ingres DBA Reference on pages 19-4 and 19-5. As we know, the Network Health Reports will mostly query the statistics tables, so let us have a look at the most recent complete stats table we can find: * select table_name from nhv_stats_tables where max_range=(select max(max_range)-3600 from nhv_stats_tables)\g Executing . . . lqqqqqqqqqqqqqqqqqqqqqqqqqqqqk xtable_name x tqqqqqqqqqqqqqqqqqqqqqqqqqqqqu xnh_stats0_979293599 x mqqqqqqqqqqqqqqqqqqqqqqqqqqqqj (1 row) continue * help table nh_stats0_979293599\g Executing . . . Name: nh_stats0_979293599 Owner: neth Created: 12-jan-2001 04:04:34 Location: ii_database Type: user table Version: OI2.0 Page size: 2048 Cache priority: 0 Alter table version: 0 Alter table totwidth: 156 Row width: 156 Number of rows: 15081 Storage structure: heap Compression: hidata Duplicate Rows: allowed Number of pages: 508 Overflow data pages: 505 Journaling: enabled after the next checkpoint Base table for view: no Optimizer statistics: none Column Information: Key Column Name Type Length Nulls Defaults Seq sample_time integer 4 no no element_id integer 4 no no delta_time integer 4 no no good_polls integer 2 no no missed_polls integer 2 no no bad_polls integer 2 no no reboots integer 2 no no total_time integer 4 no no available_time integer 4 no no reachable_time integer 4 no no latency integer 4 no no dll_frames float 4 no no dll_bytes float 4 no no dll_mcasts float 4 no no dll_bcasts float 4 no no dll_rcv_off_frames float 4 no no dll_xmt_off_frames float 4 no no dll_transits float 4 no no dll_enet_frames float 4 no no dll_collisions float 4 no no dll_errors float 4 no no dll_algn_errors float 4 no no tr_set_recovery_mode float 4 no no tr_signal_loss float 4 no no tr_bit_streaming float 4 no no tr_contention_streaming float 4 no no tr_line float 4 no no tr_burst float 4 no no tr_internal float 4 no no tr_abort float 4 no no tr_address_copied float 4 no no tr_congestion float 4 no no tr_lost_frame float 4 no no tr_token float 4 no no tr_frequency float 4 no no tr_frame_copied float 4 no no tr_llc_frames float 4 no no packets_in float 4 no no bytes_in float 4 no no packets_out float 4 no no bytes_out float 4 no no Secondary indexes: Index Name Structure Keyed On nh_stats0_979293599_ix1 btree sample_time, element_id nh_stats0_979293599_ix2 btree element_id, sample_time continue * Notice the values for the Number of pages (508) and Overflow data pages (505). 505/508 is about 99%, which is far more than the 10-15% allowed. In this case it really indicates the usage of inefficient indexes. Usually indexes are added to boost the performance of queries. They always tend to slow down data manipulation. So, when we query our statistics tables we will loose far more performance by the overhead than we'll gain by making use of the indexes! And, why do we need those 2 indexes anyhow; they are more or less the same! Possible improvement: Redesign the Nethealth database, using for example different storage structures. Use unique keys. Reorganize and/or tailor tables with the modify command. See the DBA reference on pages 19-4 and 19-5 and all related pages referred to. Priority: A Submitted: January 2001 Concord-ID: N/A Status: N/A Actually, the stats tables are stored as heap with separate btree indexes. All heap tables are stored this way. The btree indexes are fine because no new data is ever added to these tables once they are indexed. The discussion you pointed out was only dealing with btree tables. There is a logical which can be set which will cause a sysmod to be run for the system tables and which will also cause our nhiDbMaint program to be run to trim all of the Nethealth btree tables. I have asked Concord's Technical Support to discuss this with you. 2/9/2001 10:31:39 AM bhinkel The ball is really in Support's hands here, so this ticket has been changed to MoreInfo. 2/15/2001 9:42:42 AM schapman The customer has admitted that their System backups were the cause of the problems they were obeserving. I am closing the problem ticket. 2/2/2001 5:57:36 PM wburke Cu< stomer saw this problem 12/27/00 was fixed by dropping table and index from rollup Boundary. Customer saw same problem (different table) again on 1/29/01 Customer is sending in Db to determine why it fails consistently. 2/2/2001 6:00:00 PM wburke Problem: Begin processing (01/29/2001 08:15:49 AM). Error: Append to table nh_dlg1b_980398799 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 22 rows not copied because duplica te key detected. Resolved: AS per dB T-shoot guide, - Log into your Nethealth server as the nhUser. - Source the nethealthrc.csh file - issue the following commands: echo "drop table nh_dlg0_980341199\g" | sql nethealth echo "delete from nh_rlp_boundary where max_range = 980341199 and rlp_stage_nmbr = 0 and rlp_type = 'BD'" | sql nethealth Next run $NH_HOME/bin/sys/nhiDialogRollup 2/5/2001 2:07:38 PM wburke db on voyagerii/44000/44959/db/44959.tdb 2/7/2001 1:59:21 PM yzhang has to set the same envirnment as the customer, load database, and do a debug to figure our if the dialog rollup fail is due to rollup program or conversation poller for the particular customer. 2/7/2001 3:54:58 PM yzhang ticket closed because the dialogrollup only fail two times in two month, and the latest fail (second time fail) has been fixed. 4/19/2001 1:56:05 PM wburke OK. Dialog Rollups failed again. New db on voyagerii/48000/48441/dB/ 5/3/2001 12:09:10 PM wburke Dialog Failure 48441 5/4/2001 9:24:02 AM yzhang have customer run this script after login as nhuser and sourcing nethealthrc.csh. then run the conversation rollup. I did the test on the customer's db, and the rollup worked after running the script. Thanks Yulun 5/7/2001 3:22:38 PM jpoblete Yulun, the Conversation Rollup has been fixed, but now the Traffic Accountant report fails I'll be researching on this and will let you know. 5/9/2001 3:38:34 PM jpoblete Yulun, the rollups are running, please close this one. 5/9/2001 3:39:02 PM jpoblete updated status to assigned. 5/24/2001 2:28:56 PM yzhang will check with Robin to see if we consider this a bug 11/20/2001 11:07:32 AM yzhang I noticed 44959 has been closed, and I will close the assciated prob. ticket. 11/20/2001 11:07:47 AM yzhang I noticed 44959 has been closed, and I will close the assciated prob. ticket. 2/12/2001 2:04:36 PM wburke Unloading table nh_element . . . Fatal Internal Error: Unable to execute 'COPY TABLE nh_element () INTO '/opt/nethealth/idb/support.tdb/smt_b452'' (E_RD0060 Cannot access table information due to a non-recoverable DMT_SHOW error (Mon Feb 12 12:36:14 2001) ). (cdb/DuTable::saveTable) 2/12/2001 5:39:09 PM yzhang Can you check with customer if they are doing remote polling. also tell them they should run sysmod command frenquetly. Then have them run the following command: verifydb -mrun -sdbname "nethealth" -oDrop_table "nh_element_ix2" and send us the iivdb file. The problem now is that we have to find a way to access the nh_element table, otherwise they will loss the data 2/12/2001 5:49:13 PM wburke Requested info. 2/13/2001 10:18:29 AM yzhang It is good, at least we can access the table now. Can get the following information so we can determine the next step: 1) are they running distributed poll 2) run nhDbStatus > nhDbStaus.out 3) echo "copy nh_element() into 'nh_element.dat'\g" | sql nethealth just give me the answers for question 1 and 2, have them keep nh_element.dat. Thanks 2/13/2001 12:23:46 PM yzhang This is a really bad now. We are not sure if we can get data from the saved file. I will try the following: 1) create temp table then load the data to the temp table,if this don't work, I will create table with 4K page size. I will send you script for this. Yulun 2/14/2001 12:09:24 PM wburke script did not work: E_US07DA Duplicate object name 'nh_element_test'. (Tue Feb 13 16:17:09 2001) E_CO0005 COPY: can't open file 'smt_b450'. E_CO0022 COPY: Internal error initializing COPY. (0 rows) Need status. 2/14/2001 12:58:26 PM yzhang The nh_element table was recovered from the latest smt_b452. now he is recovering the database. After that some clean up need to done. Can you check to see if somebody know how to set the bit per second variable. Thanks Yulun 2/14/2001 4:32:07 PM yzhang Afetr he finish the deleting, do the following: 1) copy nh_element() from 'smt_b452.zip' ( he need to mv 'smt_b452.bak.zip to smt_b452.zip, and have the smt_b452.zip file located in the directory where he run the copy statement 2) after the step 1, run the following: 1) select count (*) from nh_element where element_class = 1 \g 2) select count (*) from nh_element where element_class in (2,3)\g 3) output of nhDbStatus send us the count and the output of nhDbStatus. have him keep test_db database and smt_b452.bak.zip file, he needs it if something getting wrong. Thanks Yulun 2/15/2001 10:43:34 AM jpoblete Customer could not delete the rows on nh_element table, the command hung and didn't deleted the rows, it ran overnight. Yulun is working on this issue. 2/17/2001 5:12:29 PM yzhang The customer has cleaned the node from nh_element table, after he loads the table they will be all set. 2/20/2001 9:53:30 AM yzhang The problem was resolved, and problem ticket was closed 02/13/2001 11:27:14 AM jpoblete Customer: Johnson & Johnson NH Version: 4.7.1 P01 Problem: Conversation Rollup hungs, the first rollup of the day. Disabled the Scheduled Conversation Rollup from the Network Health Scheduler and reschedule the Conversation Rollup using cron to get a advanced logging trace. The conversation rollup worked for several days, but on Feb 13 it hung again, had customer sent me the advanced logging trace and a copy of the errorlog.log. The advanced loggingtrace showed the lastentri on Feb 12 at 02/12/01 14:47:09 , however, there is no message on the errlog.log for Feb 12, but there are deadlock & QEF messages for Feb 13: NCS-NWMG::[33148 , 00001b03]: Tue Feb 13 07:47:15 2001 E_CL1004_LK_DEADLOCK Deadlock detected NCS-NWMG::[33148 , 00001b03]: Tue Feb 13 07:47:15 2001 E_DM9045_TABLE_DEADLOCK Deadlock encountered locking table nwhealth.nh_elem_analyze in database nethealth with mode 3. Resource held by session [2385 19c9]. NCS-NWMG::[33148 , 00001b03]: Tue Feb 13 07:47:15 2001 E_DM0166_AGG_ADE_FAILED Execution of ADE control block in DMF Aggregate Processor failed. NCS-NWMG::[33148 , 00001b03]: Tue Feb 13 07:47:15 2001 E_AD2103_ALLOCATED_FCN_ERR The callback function for the iitotal_allocated_pages function returned an error check the DBMS error log for more information NCS-NWMG::[33148 , 00001b03]: Tue Feb 13 07:47:15 2001 E_SC0216_QEF_ERROR Error returned by QEF. NCS-NWMG::[33148 , 00001b03]: Tue Feb 13 07:47:15 2001 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. Associated error messages which provide more detailed information about the problem can be found in the error log, II_CONFIG:errlog.log NCS-NWMG::[33148 , 00001b03]: Tue Feb 13 07:47:15 2001 E_CL1004_LK_DEADLOCK Deadlock detected NCS-NWMG::[33148 , 00001b03]: Tue Feb 13 07:47:15 2001 E_DM9045_TABLE_DEADLOCK Deadlock encountered locking table .TABLE_ID314 in database nethealth with mode 3. Resource held by session [2385 19c9]. NCS-NWMG::[33148 , 00001b03]: Tue Feb 13 07:47:15 2001 E_DM0166_AGG_ADE_FAILED Execution of ADE control block in DMF Aggregate Processor failed. NCS-NWMG::[33148 , 00001b03]: Tue Feb 13 07:47:15 2001 E_AD2103_ALLOCATED_FCN_ERR The callback function for the iitotal_allocated_pages function returned an error check the DBMS error log for more information NCS-NWMG::[33148 , 00001b03]: Tue Feb 13 07:47:15 2001 E_SC0216_QEF_ERROR Error returned by QEF. NCS-NWMG::[33148 , 00001b03]: Tue Feb 13 07:47:15 2001 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this q< uery. Associated error messages which provide more detailed information about the problem can be found in the error log, II_CONFIG:errlog.log The Advanced logging trace and the last copy of the ingres errlog.log are available on: \\voyagerii\Escalated Tickets\44000\44531\nhiDialogRollup_dbg.txt \\voyagerii\Escalated Tickets\44000\44531\errlog3.log 4/5/2001 3:44:40 PM yzhang logged an issue with CA regarding the deadlock 4/11/2001 4:21:35 PM jpoblete Yulun, I sent you the requested info, I'll wait for your indications. 4/12/2001 2:46:25 PM yzhang attached is cleanNodes_pwc.sh and cleanNode_check.sh, have customer do a good db save, then run cleanNode_check.sh, this will produce cleanNodes_check.out, then run cleanNodes_pwc.sh and produce concord.out file, finally run cleanNode_check1.sh, and produce cleanNodes_check_after.out. Send us all of the three files. All of the scripts have been tested. 4/13/2001 10:49:03 AM jpoblete Yulun, I had customer run the scripts, I got his files, are in the call ticket directory\Apr13 4/13/2001 11:57:32 AM yzhang This is a big clean, nodes and node address pairs were reduced significantly, I will send you another script to have more clean before they can run conversation rollup. 4/13/2001 12:25:41 PM yzhang The customer has a good dbsave right? then have them turns the conversation rollup, and poller off, run the attached script, and send me the concordClean.out file. they can run conversation rollup if they run this script successfully. 4/19/2001 2:39:17 PM yzhang Have customer do the following to check if all of the TA tables are in good condition. email me all of the output files, as well as the errlog.log file. echo " select element_class, count (*) from nh_element group by element_class\g" | sql nethealth > element_class.out echo "help table nh_element\g" | sql nethealth > element.out echo "help table nh_address \g" | sql nethealth > address.out echo "help table nh_node_addr_pair \g" | sql nethealth > addr_pair.out Thanks Yulun 4/19/2001 4:24:22 PM yzhang The output query files looks perfect, but the latest information in the errlog is April 16, which did not have entries for yesterday and today reflecting the conversation rollup hanging. please check with following: 1) make sure they can write to errlog.log (check the permission, and stop and start nhserver should place entries in the errlog.log.) 2) schedule a incremental conversation rollup in each four hours, talk to Walter if you don't know how. 4) save the console message to a file, then send to me. 5) send me the rollup.log if they have one. 6) echo "help \g" | sql nethealth > help.out. Thanks Yulun 4/20/2001 5:23:06 PM jpoblete -----Original Message----- From: Poblete, Jose Sent: Friday, April 20, 2001 5:18 PM To: Zhang, Yulun Subject: 12692 Yulun, Here are the requested files. We killed today the Conversation Rollup, the DB did not went inconsistent, he Rollup was in sleeping mode not chewing any cpu. 5/3/2001 1:14:33 PM yzhang can you collect following info: echo "select table_name, create_date, num_rows from iitables order by table_name\g" | sql nethealth > iitable.out Thanks Yulun 5/4/2001 3:25:23 PM jpoblete Yulun, I have sent you the requested file. 5/10/2001 10:59:21 AM yzhang The output file looks fine 5/10/2001 11:29:23 AM jpoblete Yulun, As per our discussion we had yesterday, I have wrote a script to run the Conversation Rollups in debug with the oprion -now specifying the date and time of when is the rollup supposed to stop processing. If we run nhiDialogRollup without any argument, the rollup will run forever if we started the rollup at 11:30 AM, and we look at the status of the process at 4:30 P.M. it looks like it's trying the rolled up data collected at 4:20 PM. If we set the option -now and specify a date and time of when the rollup has to finish, the rollup will run OK, without further incident. 5/24/2001 2:33:56 PM yzhang specify the date and time with -now option when customer's rollup hangs, will talk with Robin to see if this is really a bug. 7/13/2001 1:44:00 PM yzhang Walter, Can you check with customer to see if they still have the problem, if so, collect nhCollectCustData, and output of the IPM. This is important ticket and is about to be escalated Thanks 7/13/2001 1:59:24 PM wburke -----Original Message----- From: Burke, Walter Sent: Friday, July 13, 2001 1:47 PM To: 'NSINATO@NCSUS.JNJ.com' Subject: FW: prob. 12692/44531 Nick, The problem that you reported in call ticket #44531, Rollup Failure has been assigned to an engineer. At this time, the issue has been evaluated. Have we seen continued behavior? if so please run $NH_HOME/bin/nhCollectCustData. Sincerely, 7/17/2001 8:00:59 PM wburke -----Original Message----- From: Burke, Walter Sent: Tuesday, July 17, 2001 7:49 PM To: 'NSINATO@NCSUS.JNJ.com' Subject: Ticket # 44531 - Rollup Failure Nick, The problem that you reported in call ticket #44531, Rollup Failure has been assigned to an engineer. At this time, the issue has been evaluated. Have we seen continued behavior? if so please run $NH_HOME/bin/nhCollectCustData. Sincerely, 7/18/2001 5:04:34 PM yzhang Robin and Jay, These two problems looks the same, the conversation rollup hangs, but the rollup will be running ok if the customer run the rollup with -now option (specify the date and time it should rollup to) after resetting nh_run_step, and nh_run_scheule and setting the following three environment variables: NH_DBG_OUTPATH; NH_DBG_OUTPATH="$NH_HOME/tmp" NH_UNREF_NODE_LIMIT; NH_UNREF_NODE_LIMIT=3 NH_POLL_DLG_BPM; NH_POLL_DLG_BPM=2500 Jose has placed all of these into the following script, now the both customers ran the script, and they are up and running. The script covers most of the workaround we normally recommend to customer for conversation rollup problem. My question is that are we going to corporate some of the workaround into our conversation rollup source code, or just polish the attached script so that customer can run it when encountering the problem. Thanks Yulun 7/23/2001 2:19:14 PM jpoblete Yulun...I will move this to WIP 11/20/2001 11:24:14 AM yzhang Jose, This is conversation rollup problem created long time ago, I knew the customer is up and running. I noticed in the last few months we have less tickets on conversation rollup. My recommandation is to close this one. If the same problem happens from other customer, we can work through the more complete solution. Thanks Yulun. 11/20/2001 12:09:59 PM yzhang This is conversation rollup problem created long time ago, I knew the customer is up and running. I noticed in the last few months we have less tickets on conversation rollup. My recommandation is to close this one. If the same problem happens from other customer, we can work through the more complete solution. Thanks Yulun. 11/26/2001 8:06:34 AM ebetsold There is more than one call ticket associated with this bug. I have Henkel which is a customer of ICS that has the same problem. Please reopen and work on this bug. 11/26/2001 6:29:47 PM yzhang Can you both check with the customer to see if they still have the original problem. 11/30/2001 4:22:18 PM yzhang two associated call tickets have been closed, thus the problem has benn solved 12/10/2001 9:09:51 AM ebetsold ICS original problem is still occurring. They are presently running ver 4.8 12/10/2001 5:49:37 PM yzhang Eric, I noticed you reopened this one and seeing ICS still has the problem. Can you give me more detail description regarding what problem they have. Thanks Yulun 12/12/2001 9:57:56 AM ebetsold A Scheduled Conversation Rollup takes more than 2 days to run and then finishes normally (reproducible). This results in other Jobs to be delayed, e.g. Server goes down on Tuesday noon for Maintenance. Only the Sunday Rollup at 1am is affected, all others are finish< ing fast. This has migrated to a 4.8 upgrade. Do you require more information? If so, what? 1/16/2002 9:20:51 AM dbrooks closed per critical bug meeting. will reopen if customer requests. 1/24/2002 9:19:04 AM ebetsold Customer requests it stay open. Why was it closed? 2/26/2002 5:03:04 PM yzhang send nhCollect.tar through running nhCollectCustData. The point is that why there is only sunday dailog rollup affected, possible reasons are: 1) maintainace job hangs 2) do have have environmental variable NH_RESET_INGRES set? 3) how are the other jobs scheduled for sunday afternnon. Do these job also hang? 3/4/2002 11:31:34 AM ebetsold Below please find the answers to your questions: > 1) does the customers maintainace job hang? no, this job completed. > 2) does the customer have have environmental variable NH_RESET_INGRES set? no, this variable is undefined. > 3) how are the other jobs scheduled for sunday afternnon. Do these job also > hang? no. all jobs ran fine. please find a current nhCollectCustData file on server BAFS Escalated Tickets 51000 51989 3/4/2002 12:05:19 PM yzhang based on the information you posted, customer should do the following: 1) their free space is below 1G, they need to add at least 2G disk space 2) they keep too much stats0 data, see if they can reset the length of rwa data they want to keep. 3) what is the transactionlog size they currently has. Yulun 4/4/2002 4:02:55 PM schapman Call Ticket closed due to non-response. Closing Problem Ticket 2/13/2001 12:58:33 PM rrick Hi Dave, The Nethealth Database Troubleshooting Documentation says to speak to you about aquiring the following new module for any customers on v4.5 if they are getting the following message: Begin processing (01/31/2001 15:19:21). Error: Append to table nh_dlg1s_979707599 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 8 rows not copied because duplicate key detected. ). A new build for a new cdb library and nhiDialogRollup in the build view correct for the customer's platform. (fx_hp, fx_sol, fx_NT). Can you please help? Thanks again, Russell K. Rick 3/1/2001 8:19:45 PM rtrei Yulun-- See if Rick did get this from Dave A. If it solved the problem, close the ticket. Otherwise, we will need to decide what we want to do about a 4.5 system. 4/3/2001 12:04:52 PM yzhang Russell, Can you check with customer to see if they still have the problem. Thanks Yulun 4/3/2001 1:21:14 PM rrick Friday, March 02, 2001 4:58:14 PM mmcnally -----Original Message----- From: Raymond_Yiu@vanguard.com [mailto:Raymond_Yiu@vanguard.com] Sent: Friday, March 02, 2001 3:34 PM To: McNally, Mike; Yi_Jia_Zhang@vanguard.com; Lauren_Dygowski@vanguard.com; Anand_Sampath@vanguard.com Subject: Re: 45128 - Conversations rollup failures. Hi, Mike! The conversation rollup finally completed today after 3 full days. Whew! Please go ahead and close this ticket. Thank you, Russ Rick, and Bob Keville for your help on this ticket! Ray s2/16/2001 5:48:56 PM dmcauliffe If you create or modify a scheduled Database Save, and enter a leading space(s) for the directory name, then no directory is created. For example, I ran a Database Save and stored it in D:/nethealth/db/save/Testing. Everything ran fine, checked the log file and it created the correct directory: D:/nethealth/db/save/Testing.tdb I then created a scheduled job for the Database Save and stored it in directory D:/nethealth/db/save/ Testing_again (that's " Testing_again"). I was able to hit the Scheduled button and save it with no problems. I then attempted to modify the scheduled job, and the directory saved was: D:/nethealth/db/save/ No .tdb directory was shown. Code should be modified to either display an error message that there's a problem with the directory name, or automatically suppress leading spaces before storing new name. Also, if the directory was entered with enclosed spaces (using more than one name - ex: "test area"), the directory created would be "test.tdb". Found this because the customer had a scheduled Database Save that ran nightly. When he searched /nethealth/db/save, all he could find were zip files (couldn't find any .tdb directories). In checking further, found that the scheduled job referenced no .tdb directory. Had him modify his scheduled job to reference a new directory name with no extension and no leading spaces. 3/1/2001 2:52:39 PM rtrei I recommend that this be fixed for 5.5, 2 days to fix. 3/1/2001 2:52:55 PM rtrei . 7/24/2001 4:58:30 PM rlindberg re-assign to karla. A simple check for spaces should be added. There is code to create a validator for the field that you can add. See me and I'll show you the example. 7/27/2001 3:17:06 PM keusebio This has been fixed. 7/27/2001 3:18:29 PM keusebio I am changing the status to fixed 2/21/2001 11:48:30 AM jnormandin Problem: Customer is running a multi-platform distributed polling environment. At one point they conducted a fetch on a cross platform db save without utilizing the -ascii flag. This caused the "latest rollup" time stamp to be at a UTC value of 2036. Since there are no rollups occuring after 2036, this time stamp never changes. Although the customer has since reloaded ingres, the incorrect time stamp still remains. All other data and polling seem correct. Customer is looking to have the time stamp corrected. We should be able to modify the table where this is stored with SQL and as long as it is changed to something earlier then the next rollup date, the issue should self correct at the next rollup time. 2/26/2001 11:10:29 AM jnormandin Is there any update to this issue ? Robin Trei stated that it shoud be an easy fix and would just require some simple SQL commands. 2/27/2001 4:19:34 PM jnormandin From: Normandin, Jason Sent: Tuesday, February 27, 2001 4:24 PM To: Lemmon, Jim Subject: problem ticket # 12771 Jim. I notice that problem ticket 12771 is still designated in 'new' status. Has there been an update to this as of yet ? The bug was logged on Feb 21. Thanks Jason 3/1/2001 2:55:51 PM rtrei Phil-- Can you try to get an sql script to the customer site within the next 2 weeks? If not, let me know, but I figure its a learning opportunity :> Come see me and/or Yulun for advice. 3/7/2001 12:24:25 PM jnormandin Any update ? 3/27/2001 2:31:23 PM jnormandin From: Normandin, Jason Sent: Tuesday, March 27, 2001 2:36 PM To: Adams, Phillip Subject: problem ticket 12771 Phil. Could you please update me regarding this open problem ticket ? Has a possible solution been determined for this issue? Cheers Jason 4/17/2001 10:00:18 AM jnormandin rom: Normandin, Jason Sent: Tuesday, April 17, 2001 10:02 AM To: Adams, Phillip Subject: problem ticket # 12771 Importance: High Phil. Could you please update me in regards to the status of this problem ticket which was opened on 2/21. I have yet to recieve word on a possible fix for this problem. I would like to update the customer asap in regards to this. Cheers Jason 4/19/2001 12:19:15 PM jnormandin From: Normandin, Jason Sent: Thursday, April 19, 2001 12:20 PM To: Trei, Robin Cc: Gray, Don; Adams, Phillip Subject: Problem ticket 12771 UTC 2031 time stamp shown in DB status last roll up window Importance: High Robin. I have yet to receive a fix or a reply from Philip Adams regarding this problem ticket. The customer is getting a little agitated and would like a fix asap. Could you please find out from Philip where he stands on this issue and if he or you could provide me with the necessary info to correct the date stamp. Thanks Jason 4/20/2001 1:49:50 PM yzhang As described, the latest rollup" time stamp to be at a UTC value of 2036", where is the UTC time being shown. 4/20/2001 2:21:36 PM jnormandin From: Normandin, Jason Sent: Friday, April 20, 2001 2:21 PM To: Zhang, Yulun Subject: RE: problem ticket 1277< 1 DB status date issue Yulun, The UTC time stamp is actually not being shown... The date being shown is 2037, which I equated to the 999999999 utc time stamp maximum. The 2037 is showing up in the 'latest entry' in the database>status UI. Jason 4/20/2001 3:11:18 PM yzhang Can you help to get these information: 1) echo " select * from nhv_stats_tables\g" | sql nethealth > stats_table.out 2) echo " select * from nhv_rlp_tables\g" | sql nethealth > rlp_table.out 3) echo " select * from nh_rlp_boundary\g" | sql nethealth > boundary.out Thanks Yulun 4/20/2001 4:02:27 PM jnormandin From: Normandin, Jason Sent: Friday, April 20, 2001 4:03 PM To: Zhang, Yulun Cc: Trei, Robin; Gray, Don Subject: RE: problem ticket 12771 DB status date issue Yulun Under normal circumstances I can appreciate your desire to be thorough, but due to the customer sensivity issue ( ticket was not addressed due to personal changes) couldn't we just get the SQL commands necessary to change the Latest Entry for rollups to any date earlier than the current? Seeing as though the time stamp is that of 2037, the latest rollup can never be greater than this , thus the time is never adjusted. Wouldn't simply changing this to any date earlier than the last rollup ( Eg. Jan 1, 2000 ) solve this problem as the next rollup would correct this to the appropriate value? I appreciate your understanding in regards to this sensitive matter. Cheers Jason 4/26/2001 9:59:01 AM jnormandin - Files forwarded to Yulun, and place on Bafs, 43027 4/26/2001 5:44:26 PM jnormandin From: Normandin, Jason Sent: Thursday, April 26, 2001 5:46 PM To: Zhang, Yulun Cc: Gray, Don; Ciavarro, Mike; DaSilva, Al Subject: problem ticket 12771 DB status incorrect date Yulun. This email is in follow up to the voice mail I have just left on your machine ( 5:40 pm , Thursday April 26th ). I forwarded the requested sql output this morning and have yet to receive word back as of yet. I realize this is not an escalated bug but Robin Trie has conveyed that this should be a relatively simple fix. This issue is now reaching a point where customer sensitivity is now involved so we need to move rapidly to get this issue resolved. I sincerely appreciate your help in addressing this issue. Thanks Jason 4/27/2001 11:36:08 AM yzhang Have customer run the following query, then run nhiDbStatus echo "update nh_rlp_boundary set max_sample_time = 973061999 where max_sample_time = 2143092537\g" | sql $NH_RDBMS_NAME Yulun 5/1/2001 8:46:10 AM jnormandin - SQL statement provided resolved issue 2/22/2001 2:00:21 PM cbjork At one point, it appeared that the statistics and conversations rollups were failing, but upon review of the current logs, they seem to be completing. Receiving multiple Dr. Watson errors: nhiDialogRollup.exe Exception: stack overflow (0x00000fd), Address: 0x704f03a7 2/22/2001 3:48:26 PM cestep Same as bug #11367 Waiting for binary from Yulun Zhang. 2/23/2001 7:34:30 AM tstachowicz Per the user manual 4.7.1: page 15-14 "Once you have specified your checkpoint save location (using nhMvCkpLocation), you should not change it." and page 15-15 "If you save your database to a different location each day (to create a backup in case of catastrophic failure), you should not use the nhMvCkpLocaiton command to reset the location. You should save to the same location every time; then copy and tar all of the information to the new locations." User would like to see an error when trying to re-locate the checkpoint save location. Currently it allows the user to change it. 9/1/2001 3:19:11 AM AR_ESCALATOR Administrative change. This ticket has been created as an Enhancement Request. *2/23/2001 2:20:58 PM jpoblete Customer's scheduled conversation rollups failed silently, no error message in the job log. Had customer ran the conversation rollup by hand, got a segmentation fault, but could not locate a core file. Ran the conversation rollup with options -Dall and got a debug file near 230 Mb, also got a core file from yesterday. Additional is available from customer upon request. 2/28/2001 5:32:31 PM yzhang Having customer do the following: 1) check to make sure the disk controller and disks are good, because there are several errors in errlog.log indicating some thing wrong with the disks. 2) check the transaction log, if they have more than 200,000 nodes or elements, increase the transaction log to 1500 or 2000. 3) If these two do not solve the problem, unlimit the stack size ( db worksheet page 10). Thanks Yulun 2/28/2001 5:33:13 PM yzhang 1) check to make sure the disk controller and disks are good, because there are several errors in errlog.log indicating some thing wrong with the disks. 2) check the transaction log, if they have more than 200,000 nodes or elements, increase the transaction log to 1500 or 2000. 3) If these two do not solve the problem, unlimit the stack size ( db worksheet page 10). Thanks Yulun 3/8/2001 4:10:20 PM jpoblete Called customer, the rollups are not failing, but are taking quite some time to finish, customer already have opened another ticket on that subject. Please close this issue. 4/19/2001 1:45:36 PM pkuehne See previous note. 2/26/2001 10:12:28 AM foconnor Cusotmer was having their conversation rollups fail and as per DB Troubleshooting guide tables were dropped and a new nhDialogRollup executable was sent and they are still experinencing conversation rollup failures. Spoke to Yulun and Yulun suggested that I bug this issue and he will have a look. Original failure: $NH_HOME/bin/sys/nhiDialogRollup -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (02/21/2001 04:30:30 AM). Error: Append to table nh_dlg1s_981003599 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 95 rows not copied because duplicate key detected. ). ----- This information was collected: echo "help\g" | sql nethealth > tables.out echo "select table_name, create_date from iitables\g" | sql nethealth > concord.out echo "select table_name, num_rows from iitables\g" | sql nethealth >> concord.out Tables were dropped as per DB Troubleshooting guide: Please run the following command: 1. echo "drop table nh_dlg0_981003599;\g" sql nethealth > drop.out 2. echo "delete from nh_rlp_boundary where max_range = 981003599 and rlp_stage_nmbr = 0 and (rlp_tpye = 'BD' or rlp_type = 'SD');\g" | sql nethealth >> drop.out 3. echo "drop table nh_dlg0_980989199;\g" sql nethealth >> drop.out 4. echo "delete from nh_rlp_boundary where max_range = 980989199 and rlp_stage_nmbr = 0 and (rlp_tpye = 'BD' or rlp_type = 'SD');\g" | sql nethealth >> drop.out 5. echo "drop table nh_dlg0_980974799;\g" sql nethealth >> drop.out 6. echo "delete from nh_rlp_boundary where max_range = 980974799 and rlp_stage_nmbr = 0 and (rlp_tpye = 'BD' or rlp_type = 'SD');\g" | sql nethealth >> drop.out Send drop.out Let nhiDialopRollup occur New nhDialogRollup.exe was sent to customer and installed and the rollups are still failing. New Conversation Rollups failing: Begin processing (02/22/2001 04:30:41 AM). Error: Append to table nh_dlg1b_981003599 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 14 rows not copied because duplicate key detected.). ----- Scheduled Job ended at '02/22/2001 04:30:49 AM'. ----- Thursday, February 22, 2001 2:02:59 PM wburke Begin processing (02/22/2001 01:52:57 PM). Table nh_rlp_boundary inconsistent, deleting row: type: BD stage: 1 max_range: 981003599. Error: Append to table nh_dlg1b_981089999 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 13 rows not copied because duplicate key detected. ). netmon% Files: /voyagerii/tickets/45000/45350 2/26/2001 10:15:05 AM foconnor -----Original Message----- From: O'Connor, Farrell Sent: Monday, February 26, 2001 10:10 AM To: Zhang, Yulun Cc: Trei, Robin; O'Connor, Farrell Subject: Problem ticket< 12826 Yulun, As per our conversation I have submitted Problem ticket 12826 for Call ticket 45350 regarding Conversation Rollup failures. 3/1/2001 12:33:41 PM yzhang requested information 3/1/2001 1:04:26 PM yzhang please get the following: echo "select table_name, create_date, num_rows from iitables where table_name like 'nh_dlg%'\g" | sql nethealth > tabletimes.out We've had a few cases where tables were created long after they should have been, so this will help with that. All the logs in $II_SYSTEM/ingres/files/*.log Especially make sure you get the errlog.log The Nethealth system messages. Save the messages using the NH console. All the files in $NH_HOME/logs (just tar up the entire directory, is easiest) Thanks Yulun 3/1/2001 3:59:24 PM foconnor files have been received: //voyagerii/tickets/45000/45350/3-01-2001 3/3/2001 4:04:34 PM yzhang have customer drop table nh_dlg0_981017999, then run the following query: delete from nh_rlp_boundary where max_range =981017999 and rlp_stage_nmbr = 0 and rlp_type = 'BD' Yulun 3/6/2001 2:34:48 PM foconnor Sent customer drop_dlg.sh script to perform the above. 3/7/2001 4:59:37 PM yzhang Have them run the following to drop one more table, then run the conversation rollup. If there is still a problem after this, obtain their database and let me know. Thanks Yulun echo "drop table nh_dlg0_981089999; commit;\g" | sql nethealth >> drop2.out echo "delete from nh_rlp_boundary where max_range = 981089999 and rlp_stage_nmbr = 0 and rlp_tpye = 'BD'; commit;\g" | sql nethealth >> drop.out 3/8/2001 10:45:12 AM yzhang Your script has error, and the execution was failed. have customer run the attached script just by typing the script name, then run the conversation rollup. send me the concord.out (output of executing the script), don't drop nh_dlg1b_981089999 table. Please don't modify my script, it has been tested. Let me know the resullt. Thanks Yulun 4/19/2001 2:33:44 PM yzhang This ticket has been in more info for long time, Can you check with the customer to see if their problem has been solved. Thanks Yulun 4/19/2001 2:40:25 PM yzhang problem solved "62/26/2001 10:48:06 AM dblodgett NH Server is stopping during the weekend customer is losing data customer says that this has occured before and he has rebooted the server to resolve the issue note: customer is running NH 4.8 (installed on February 5th) what is causing this? customer sent the errlog.log and NHsystem.log (located on voyagerii/46156) possible relevant cut from the errlog.log /////////////////////////////////////////////////////////////// OAKNCCNT::[II\INGRES\19f , 00000262]: Sun Feb 25 01:49:59 2001 E_GC0001_ASSOC_FAIL Association failure: partner abruptly released association OAKNCCNT::[II\INGRES\19f , 000000e1]: Sun Feb 25 02:25:02 2001 E_GC0001_ASSOC_FAIL Association failure: partner abruptly released association ... >>>CS_SCB found at 012B5040<<< cs_next: 012C1E40 cs_prev: 012A6B80 cs_length: 15108. cs_type: FFFFABCD cs_self: 0000021A (538.) cs_stk_size: 00000000 (0.) cs_state: CS_COMPUTABLE (00000001) cs_mask: (00000000) cs_mode: CS_INPUT(00000002) cs_nmode: CS_OUTPUT(00000003) cs_thread_type: CS_NORMAL(00000000) cs_username: ingres cs_sem_count: 00000000 ----------------------------------- Stack trace beginning at 7025fac3 Stack dmp name II\INGRES\19f pid 415 session 21a: 7025b1f7: (OIDMFNT,Base:701e0000)7025f480( 00e617b8 00000000 00000000 000000 6a 001fbee3 42dad7a4 ) Stack dmp name II\INGRES\19f pid 415 session 21a: 70257535: (OIDMFNT,Base:701e0000)7025aa35( 010d7e60 0000002a 0000013d 000000 6a 00000000 001fbee3 014ad0e0 42dad550 00000200 42dad7a4 ) Stack dmp name II\INGRES\19f pid 415 session 21a: 70256a77: (OIDMFNT,Base:701e0000)702571b2( 010d7e60 0000002a 00000000 000002 00 0000006a 001fbee3 014ad0e0 00000000 701d1024 014ad0ec 42dad7a4 ) Stack dmp name II\INGRES\19f pid 415 session 21a: 702bbb4a: (OIDMFNT,Base:701e0000)702566de( 014ad0c0 0000002a 00000200 014ad0 ec 42dad7a4 ) Stack dmp name II\INGRES\19f pid 415 session 21a: 702b7bb9: (OIDMFNT,Base:701e0000)702bb9e6( 014ad0c0 0132830c 42dad6f0 42dad6 f8 42dad72c 42dad7a4 ) Stack dmp name II\INGRES\19f pid 415 session 21a: 702d754e: (OIDMFNT,Base:701e0000)702b7b39( 014ad0c0 01328598 0132830c 000000 01 42dad7a4 ) Stack dmp name II\INGRES\19f pid 415 session 21a: 7022ad79: (OIDMFNT,Base:701e0000)702d73b9( 014ad0c0 01328598 00000100 013283 0c 42dad7a4 ) Stack dmp name II\INGRES\19f pid 415 session 21a: 701f78ac: (OIDMFNT,Base:701e0000)7022aaf0( 01328568 ) Stack dmp name II\INGRES\19f pid 415 session 21a: 7078756a: (OIDMFNT,Base:701e0000)701f7740( 0000001f 01328568 ) Stack dmp name II\INGRES\19f pid 415 session 21a: 707a5aa9: ????????( 0123ec98 012be7a0 00000000 ) Stack dmp name II\INGRES\19f pid 415 session 21a: 707af74f: (OIQEFNT,Base:70760000)707a5950( 0123ebdc 012be7a0 00000001 ) Stack dmp name II\INGRES\19f pid 415 session 21a: 7077271f: (OIQEFNT,Base:70760000)707ae782( 012be7a0 00000010 ) Stack dmp name II\INGRES\19f pid 415 session 21a: 708f314f: (OIQEFNT,Base:70760000)70771460( 00000010 012be7a0 ) Stack dmp name II\INGRES\19f pid 415 session 21a: 70146ed9: ????????( 00000002 012b5040 012b5080 ) Stack dmp name II\INGRES\19f pid 415 session 21a: lstrcmpiW: ????????( ) 0000021a General Protection Exception @7025fac3 SP:42dad054 BP:42dad228 AX:0 CX:3a36860a DX:2f793e8 BX:12b5040 SI:12b9ae0 DI:12bf35c 0000021a Sun Feb 25 03:00:28 2001 E_DM9049_UNKNOWN_EXCEPTION An Unexpected Exception occurred in the DMF Facility, exception number 68197. OAKNCCNT::[II\INGRES\19f , 0000021a]: An error occurred in the following session: OAKNCCNT::[II\INGRES\19f , 0000021a]: >>>>>Session 0000021A<<<<< OAKNCCNT::[II\INGRES\19f , 0000021a]: DB Name: nethealth (Owned by: nethealth ) OAKNCCNT::[II\INGRES\19f , 0000021a]: User: nethealth (ingres ) OAKNCCNT::[II\INGRES\19f , 0000021a]: User Name at Session Startup: nethealth OAKNCCNT::[II\INGRES\19f , 0000021a]: Terminal: console OAKNCCNT::[II\INGRES\19f , 0000021a]: Group Id: OAKNCCNT::[II\INGRES\19f , 0000021a]: Role Id: OAKNCCNT::[II\INGRES\19f , 0000021a]: Application Code: 00000000 Current Facility: QEF (00000006) OAKNCCNT::[II\INGRES\19f , 0000021a]: Client user: ingres OAKNCCNT::[II\INGRES\19f , 0000021a]: Client host: OAKNCCNT12 OAKNCCNT::[II\INGRES\19f , 0000021a]: Client tty: OAKNCCNT12 OAKNCCNT::[II\INGRES\19f , 0000021a]: Client pid: 500 OAKNCCNT::[II\INGRES\19f , 0000021a]: Client connection target: nethealth OAKNCCNT::[II\INGRES\19f , 0000021a]: Client information: user='ingres',host='OAKNCCNT12',tty='OAKNCCNT12', pid=500,conn='nethealth' OAKNCCNT::[II\INGRES\19f , 0000021a]: Description: OAKNCCNT::[II\INGRES\19f , 0000021a]: Query: select 1 from iitables where table_name= ~V and table_type= ~V 0000021a Sun Feb 25 03:00:28 2001 E_DM004A_INTERNAL_ERROR Internal DMF error detected. 0000021a Sun Feb 25 03:00:28 2001 E_QE009C_UNKNOWN_ERROR Unexpected error received from another facility. Check the server error log. 0000021a Sun Feb 25 03:00:28 2001 E_DM004A_INTERNAL_ERROR Internal DMF error detected. 0000021a Sun Feb 25 03:00:28 2001 E_QE009C_UNKNOWN_ERROR Unexpected error received from another facility. Check the server error log. 0000021a Sun Feb 25 03:00:28 2001 E_DM004A_INTERNAL_ERROR Internal DMF error detected. 0000021a Sun Feb 25 03:00:28 2001 E_QE009C_UNKNOWN_ERROR Unexpected error received from another facility. Check the server error log. 000< 0021a Sun Feb 25 03:00:28 2001 E_QE009C_UNKNOWN_ERROR Unexpected error received from another facility. Check the server error log. 000001e0 Sun Feb 25 03:00:28 2001 E_DM004A_INTERNAL_ERROR Internal DMF error detected. 000001e0 Sun Feb 25 03:00:28 2001 E_QE009C_UNKNOWN_ERROR Unexpected error received from another facility. Check the server error log. OAKNCCNT::[II\INGRES\19f , 000001e0]: Sun Feb 25 03:00:28 2001 E_QE009C_UNKNOWN_ERROR Unexpected error received from another facility. Check the server error log. OAKNCCNT::[II\INGRES\19f , 000001e0]: Sun Feb 25 03:00:28 2001 E_DM004A_INTERNAL_ERROR Internal DMF error detected. OAKNCCNT::[II\INGRES\19f , 000001e0]: Sun Feb 25 03:00:28 2001 E_SC0122_DB_CLOSE Error closing database. Name: nethealth Owner: nethealth OAKNCCNT::[II\INGRES\19f , 000001e0]: Sun Feb 25 03:00:28 2001 E_SC010D_DB_LOCATION Database Location Name: $default Physical Specification: D:\nethealth\database\ingres\data\default\nethealth Flags: 00000003 OAKNCCNT::[II\INGRES\19f , 000001e0]: Sun Feb 25 03:00:28 2001 E_DM004A_INTERNAL_ERROR Internal DMF error detected. OAKNCCNT::[II\INGRES\19f , 000001e0]: Sun Feb 25 03:00:28 2001 E_DM004A_INTERNAL_ERROR Internal DMF error detected. OAKNCCNT::[II\INGRES\19f , 000001e0]: Sun Feb 25 03:00:28 2001 E_DM004A_INTERNAL_ERROR Internal DMF error detected. 00000187 Sun Feb 25 03:02:41 2001 E_DM004A_INTERNAL_ERROR Internal DMF error detected. 00000187 Sun Feb 25 03:02:41 2001 E_QE009C_UNKNOWN_ERROR Unexpected error received from another facility. Check the server error log. OAKNCCNT::[II\INGRES\19f , 00000187]: Sun Feb 25 03:02:41 2001 E_QE009C_UNKNOWN_ERROR Unexpected error received from another facility. Check the server error log. OAKNCCNT::[II\INGRES\19f , 00000187]: Sun Feb 25 03:02:41 2001 E_DM004A_INTERNAL_ERROR Internal DMF error detected. OAKNCCNT::[II\INGRES\19f , 00000187]: Sun Feb 25 03:02:41 2001 E_SC0122_DB_CLOSE Error closing database. Name: nethealth Owner: nethealth OAKNCCNT::[II\INGRES\19f , 00000187]: Sun Feb 25 03:02:41 2001 E_SC010D_DB_LOCATION Database Location Name: $default Physical Specification: D:\nethealth\database\ingres\data\default\nethealth Flags: 00000003 OAKNCCNT::[II\INGRES\19f , 00000187]: Sun Feb 25 03:02:41 2001 E_SC0221_SERVER_ERROR_MAX Error count for server has been exceeded. OAKNCCNT::[II\INGRES\19f , 00000187]: Sun Feb 25 03:02:42 2001 E_PS0501_SESSION_OPEN There were open sessions when trying to shut down the parser facility. OAKNCCNT::[II\INGRES\19f , 00000187]: Sun Feb 25 03:02:42 2001 E_DM004A_INTERNAL_ERROR Internal DMF error detected. OAKNCCNT::[II\INGRES\19f , 00000187]: Sun Feb 25 03:02:42 2001 E_SC0235_AVERAGE_ROWS On 13365. select/retrieve statements, the average row count returned was 1. OAKNCCNT::[II\INGRES\19f , 00000187]: Sun Feb 25 03:02:42 2001 E_SC0127_SERVER_TERMINATE Error terminating Server. OAKNCCNT::[ , 00000000]: Sun Feb 25 04:44:00 2001 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. OAKNCCNT::[ , 00000000]: Sun Feb 25 04:44:08 2001 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. OAKNCCNT::[ , 00000000]: Sun Feb 25 04:44:09 2001 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. OAKNCCNT::[ , 00000000]: Mon Feb 26 07:22:26 2001 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. OAKNCCNT::[ , 00000000]: Mon Feb 26 07:22:37 2001 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. OAKNCCNT::[ , 00000000]: Mon Feb 26 07:22:38 2001 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. OAKNCCNT::[ , 00000000]: Mon Feb 26 07:23:57 2001 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. OAKNCCNT::[ , 00000000]: Mon Feb 26 07:23:59 2001 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. OAKNCCNT::[ , 00000000]: Mon Feb 26 07:24:09 2001 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. OAKNCCNT::[ , 00000000]: Mon Feb 26 07:26:51 2001 E_GC0151_GCN_STARTUP Name Server normal startup. ///////////////////////////////////////////////////////////////////////////// 2/26/2001 10:54:37 AM rlindberg change the SW rev to R-4.8.0 2/27/2001 10:02:54 AM cestep Customer called back, his server was down this morning. Checked errlog.log, saw stack dump again last night. Server stopped a few minutes after, no log entries for the server, no polling. 2/27/2001 3:54:33 PM yzhang The problem is that The customer's nhServer on NT was stopping during the weekend. I checked the script, and it looks running fine here. Can you check with customer see what's their status with this now. If they still have the problem. have them: 1)send you nhReset.sh they are using. 2)and nhReset.out file from command sh -x nhReset.sh -db. They need to copy all the output from this command and saved into nhReset.out. 3) also send the output of : env | grep NH_RESET Thanks Yulun 2/27/2001 4:56:39 PM yzhang The problem is that The customer's nhServer on NT was stopping during the weekend. I checked the script, and it looks running fine here. Can you check with customer see what's their status with this now. If they still have the problem. have them: 1)send you nhReset.sh they are using. 2)and nhReset.out file from command sh -x nhReset.sh -db. They need to copy all the output from this command and saved into nhReset.out. 3) also send the output of : env | grep NH_RESET Thanks Yulun 2/27/2001 4:57:51 PM yzhang Above description is a mistake, it is for another tickets 2/27/2001 5:02:13 PM yzhang The problem is that The customer's nhServer on NT was stopping during the weekend. I checked the script, and it looks running fine here. Can you check with customer see what's their status with this now. If they still have the problem. have them: 1)send you nhReset.sh they are using. 2)and nhReset.out file from command sh -x nhReset.sh -db. They need to copy all the output from this command and saved into nhReset.out. 3) also send the output of : env | grep NH_RESET 2/28/2001 2:52:09 PM yzhang Forward this resizeingreslog script to customer, have him replace old one with this. then run the script. I have already talked to customer about this. Thanks Yulun 3/1/2001 9:54:16 AM yzhang After the customer ran the correct nhResizeIngresLog.sh, they can do the polling now and the system started properly. they will let us know if the same problem occurs again. 3/1/2001 11:14:48 AM yzhang Can you have customer send us their Maintenance.100004.log under $NH_HOME/log. If they have one. 3/12/2001 9:35:44 AM cestep -----Original Message----- From: Boake, John [mailto:John.Boake@jacobs.com] Sent: Monday, March 12, 2001 9:29 AM To: 'support' Subject: RE: Ticket #46156 - Nethealth stops due to Ingres stack dumps This problem seems to have been resolved. I have not had a stack dump outage in over 2 weeks now. Thanks for the help, Regards, 3/12/2001 9:40:36 AM yzhang problem solved, and ticket closed 12/27/2001 2:48:30 PM tstachowicz This customer cannot do database saves or have data analysis run because it is erroring on the following messages. Spoke with Robin, told me to get the verifydb and the size of the partition. The supporting information is on voyagerii/escalated tickets. Data_analysis: Error: Unable to exec< ute 'DELETE FROM nh_daily_exceptions h WHERE EXISTS (select * from nh_elem_analyze a where a.element_id = h.element_id and h.sample_time >= a.sample_time) and sample_time >= (select min(sample_time) from nh_elem_analyze where sample_time > 0)' (E_QE007C Error trying to get a record. Associated error messages which provide more detailed information about the problem can be found in the error log (errlog.log) (Thu Feb 22 18:08:25 2001) Database Save: Fatal Internal Error: Unable to execute 'COPY TABLE nh_daily_exceptions () INTO '/opt/nethealth/idb/nh_backup/backup1.tdb/dye_b40'' (E_SC0206 An internal error prevents further processing of this query. Associated error messages which provide more detailed information about the problem can be found in the error log, II_CONFIG:errlog.log (Thu Feb 22 16:01:24 2001) ). (cdb/DuTable::saveTable) Errlog.log: GEMINI ::[32800 , 00000026]: Thu Feb 22 06:50:44 2001 E_DM9005_BAD_FILE_READ Disk file read error on database:nethealth table:nh_daily_exceptions pathname:/opt/nethealth/idb/ingres/data/default/nethealth filename:aaaaahgl.t00 page:64142 read() failed with operating system error 0 (Error 0) 3/2/2001 2:45:46 PM tstachowicz This is where nethealth is installed: /dev/dsk/c0t1d0s0 17413250 7683057 9556061 45% /opt/nethealth/idb looks ok for size otuput of verifydb is on escalated tickets/46000/46039 3/12/2001 1:50:58 PM yzhang Please have customer collect the following information: 1) verifydb -mreport -sdbname $NH_RDBMS_NAME -otable nh_daily_exceptions, and send me the iivdb.log.(the verifydb output in the escalated directory don,t not have the information I want>) 2) ls -l II_SYSYEM/ingres/log > log.out, send me the log.out, 3) ls -l /opt/nethealth/idb/ingres/data/default/nethealth > nethealth.out, send me the nethealth.out 4) find out how long they maintain the DAC data Thanks Yulun 3/13/2001 1:34:49 PM tstachowicz -----Original Message----- From: Stachowicz, Tania Sent: Tuesday, March 13, 2001 1:30 PM To: Zhang, Yulun Subject: ticket 46039, bug 12850 Yulun, Customer sent in the requested information (please see voyagerii/escalated tickets/46000/46039). I looked at the iivdb.log file and noticed that the output says that the nh_daily_exceptions table does not exist. Please verify that the following command you asked me to execute is correct: verifydb -mreport -sdbname $NH_RDBMS_NAME -otable nh_daily_exceptions Also, the customer keeps the following baseline info for all DACs: Daily: 6 weeks Weekly: 13 weeks Monthly: 12 months Thanks, Tania 3/13/2001 3:03:27 PM yzhang The command is correct, they should be nhuser for running this verifydb command (they run is as ingres suser, this not correct) have customer resize the ingreslog to 2GB, and run the following query, send me output: echo"select table_name from iifile_info where file_name = 'aaaaahgn'\g" | sql $NH_RDBMS_NAME> aaaaahgn.out. this table is about hitting 2GB. 3/13/2001 5:02:23 PM yzhang the major problem is that their system has trouble to write into transaction logfile. and the table nh_daily_exception is reaching 2GB. You need to have the customer resizing the log to 2GB (as I mentioned from last email)before they can do anything else. Also they need to check their hardware system because there is I/O error in the errlog.log. Thanks Yulun 3/14/2001 7:43:19 AM tstachowicz requested customer to increase transaction log. Customer has sent back the output of verifydb and the table name, forwarded to yulun. 3/14/2001 9:36:18 AM yzhang Tania, If a table is 2GB, all of the activities of accessing that table will stop, sometime 1.7 to 1.8GB also causes problem. There is one table there almost 1.7 GB. currently their transaction log is 1GB which is not big enough to handle big table. Have them resize to 2GB first. Also I need the verifydb output for nh_daily_exceptions, run it as nhuser. Let me know if anything not clear. Thanks Yulun 3/16/2001 6:43:26 AM tstachowicz -----Original Message----- From: Stachowicz, Tania Sent: Friday, March 16, 2001 6:38 AM To: Zhang, Yulun Subject: ticket 46039 bug 12850-- nh_daily_exceptions error for database save and ata analysis Yulun, Customer has increased transaction log to 2G and there is no difference, the problem still persists. They will be testing their system shortly... what can we do in the meantime? Thanks, Tania 3/16/2001 1:44:02 PM yzhang After the customer says their disk is fine, and they have a good db backup. Ask them to do the following: 1) nhDestroyDb $NH_RDBMS_NAME 2) nhCreateDb nethealth 3) load the backup database 4) echo "COPY TABLE nh_daily_exceptions () INTO 'nh_daily_exceptions.dat\g" | sql nethealth 5) echo "DELETE FROM nh_daily_exceptions h WHERE EXISTS (select * from nh_elem_analyze a where a.element_id = h.element_id and h.sample_time > = a.sample_time) and sample_time >= (select min(sample_time) from nh_elem_analyze where sample_time > 0)\g" | sql nethealth > delete.out. send me nh_daily_exceptions.dat and delete.out, if they succeed on step 4 and 5, they should have no problem to run the dataanalysis and dbsave: Thanks Yulun 3/19/2001 8:16:54 AM tstachowicz -----Original Message----- From: Stachowicz, Tania Sent: Monday, March 19, 2001 8:11 AM To: Zhang, Yulun Cc: Gray, Don Subject: ticket 46039 bug 12850--DbSave and DataAnalysis fails with error Yulun, Per your request to destroy, create and load the saved db... The customer last good save db was from a month ago and they are unwilling to load from this save. They will be doing the system check as requested. I will let you know the outcome of this. How can we move forward without loading from the last save? Thanks, Tania 3/20/2001 9:56:01 AM yzhang Tania, After they think the disk is ok, do the following: 1) stop then start ingres using nhStopDb and nhStartDb verifydb -mreport -sdbname $NH_RDBMS_NAME -otable nh_daily_exceptions, run this as nhuser, and get iivdb.log 2) echo "COPY TABLE nh_daily_exceptions () INTO 'nh_daily_exceptions.dat\g" | sql nethealth 3) echo "DELETE FROM nh_daily_exceptions h WHERE EXISTS (select * from nh_elem_analyze a where a.element_id = h.element_id and h.sample_time >= a.sample_time) and sample_time >= (select min(sample_time) from nh_elem_analyze where sample_time > 0)\g" | sql nethealth > delete.out. 4) echo "help table nh_daily_exceptions]g" | sql nethealth > help.out 5) get the information regarding how long they retain stats data from statistics rollup windows, as polled, 1 hour sample, 1 day sample send me all the output file, and have them keep nh_daily_exceptions.dat for backup. Let me know if anything not clear . Thanks Yulun 3/23/2001 10:02:20 AM yzhang Tania, Looks that table can not be accessed normally. 1) they still need to make sure the dard disk is good, because there are many read error there. 2) after that, run the attached script as nhuser just typing the script name from command prompt after source nethealthrc.csh file. The script will produce a binary file called nh_daily_exception.dat under the directory where the script is running. send me this file. The script has been tested, and you don't have to do anything with it. Thanks Yulun 3/27/2001 10:27:12 AM tstachowicz still waiting on this information 3/28/2001 2:12:33 PM yzhang We have a customer who can not save a table to file. we have them tried copy table to file; copy table to temp table, copy the temp table to file; and use copydb -c dbname table name. All of them are failled. following are some of information from errlog.log. we also have the iivdb.log, and the output from copydb. These files can be sent if you want. The also checked that there is no disk and space problem. Can you tell us if there is any other ways we can have customer backup the table. < 3/30/2001 10:37:59 AM yzhang Farrell, Have customer do the following: 1) make sure they are running C shell 2) login as nhuser, and source nethealthrc.csh, login as nhuser 3) login as ingres run: sysmod $NH_RDBMS_NAME, 4) login as nhuser 5) run the attached scrip just typing the script name 6) send us nh_daily_exceptions_test.dat file produced in the directory the script was running 7) copy the whole output from running the script into a file call script.out, and send us this file. 8) don't do anything untill I see the result. 4/2/2001 10:04:37 AM yzhang Farrell, I really don't want to disturb the table very much, because the customer is running several reports now. We will call the customer again Thursday at 9:00 AM this week for backing up the table. Thanks Yulun 4/5/2001 10:34:26 AM yzhang Based on the help.out the customer sent this morning, looks they have duplicate in two of the stats0 table. have customer run the attached script just by typing the script name after login as nh_user and sourcing nethealthrc.csh. then run the stats rollup. followed by running dbsave. Id dbsave fail, we will drop the table by using verifydb drop table. 4/5/2001 1:08:04 PM yzhang run the attached script called prob_12850_create_table.sh (to create nh_daily_exceptions table) just by typing the script name after login as nh_user and sourcing nethealthrc.csh. After running the script, check the table was created, then run nhSaveDb, nhDestroyDb, nhCreateDb, and finally run nhLoadDb, Farrell, can you help him with these. make sure to keep your database backup in a safe place Thanks 4/5/2001 2:01:09 PM foconnor -----Original Message----- From: Zhang, Yulun Sent: Thursday, April 05, 2001 10:29 AM To: O'Connor, Farrell Subject: prob. 12850 Farrell, Based on the help.out the customer sent this morning, looks they have duplicate in two of the stats0 table. have customer run the attached script just by typing the script name after login as nh_user and sourcing nethealthrc.csh. then run the stats rollup. followed by running dbsave. Id dbsave fail, we will drop the table by using verifydb drop table. Thanks Yulun -----Original Message----- From: Zhang, Yulun Sent: Thursday, April 05, 2001 1:44 PM To: 'Andrew Pieterse' Cc: O'Connor, Farrell Subject: RE: prob. 12850 I modified the script, now you just run the sacript, the script will drop nh_daily_exceptions_test table, and create nh_daily_exceptions table run the attached script called prob_12850_create_table.sh (to create nh_daily_exceptions table) just by typing the script name after login as nh_user and sourcing nethealthrc.csh. After running the script, check the table was created, then run nhSaveDb, nhDestroyDb, nhCreateDb, and finally run nhLoadDb, Farrell, can you help him with these. make sure to keep your database backup in a safe place Thanks Yulun 4/10/2001 11:25:20 AM rsanginario -----Original Message----- From: Zhang, Yulun Sent: Thursday, April 05, 2001 1:44 PM To: 'Andrew Pieterse' Cc: O'Connor, Farrell Subject: RE: prob. 12850 I modified the script, now you just run the sacript, the script will drop nh_daily_exceptions_test table, and create nh_daily_exceptions table run the attached script called prob_12850_create_table.sh (to create nh_daily_exceptions table) just by typing the script name after login as nh_user and sourcing nethealthrc.csh. After running the script, check the table was created, then run nhSaveDb, nhDestroyDb, nhCreateDb, and finally run nhLoadDb, Farrell, can you help him with these. make sure to keep your database backup in a safe place Thanks Yulun 4/11/2001 7:46:08 AM tstachowicz everything is running fine. closed call. 4/11/2001 8:08:39 AM yzhang Can you check with customer, we should close this ticket if every thing is runing OK. 5/10/2001 10:35:24 AM yzhang problem solved 2/28/2001 11:08:45 AM dkrauss Customer had done a database save from command line, which ultimately had errors. Output on command line states 'See log file in for details...' Customer would like at least notification at the command line if there were errors in the save, if not the error itself. Customer goes through ICS Resellers Customer contact- Mrs. Regine Umlauft email: R.Umlauft@alcatel.de 9/1/2001 3:19:12 AM AR_ESCALATOR Administrative change. This ticket has been created as an Enhancement Request. t2/28/2001 3:35:35 PM mpoller This is submitted on behalf of Michael Steele of Gtronics. "For statistics database roll ups, I feel that we loose a lot of important data when we move from say, "As Polled" to "1 Hour Samples." I think that it would help us incredibly if we had the option to roll up to "1 hour samples with peaks." If we could keep the a maximum and minimum with the averages, our spikes do not disappear." Michael Steele Systems Analyst Getronics Government Solutions CSOC/NASA 9/1/2001 3:19:12 AM AR_ESCALATOR Administrative change. This ticket has been created as an Enhancement Request. 3/12/2001 7:09:29 PM jdodge When a checkpoint save is run it overwrites the last save with the new save, therefor if there was a problem with the night before's save there is no way to get that data back since it has been overwritten with the new save. Customer would like to be able to specify the number of days a checkpoint saved database is kept for. 9/1/2001 3:19:12 AM AR_ESCALATOR Administrative change. This ticket has been created as an Enhancement Request. )3/15/2001 7:21:41 AM tstachowicz All have been failing since around March 1st: ******************************* ERRLOG.LOG: 00000201 Thu Mar 01 17:55:46 2001 E_DM93A7_BAD_FILE_PAGE_ADDR Page 9054 in table nh_daily_health, owner: nethealth, database: nethealth, has an incorrect page number: -1. Other page fields: page_stat 0000FFFF, page_log_address (FFFFFFFF,FFFFFFFF), page_tran_id (FFFFFFFFFFFFFFFF). Corrupted page cannot be read into the server cache. 00000201 Thu Mar 01 17:55:46 2001 E_DM9206_BM_BAD_PAGE_NUMBER Page number on page doesn't match its location. 00000201 Thu Mar 01 17:55:46 2001 E_DM93A7_BAD_FILE_PAGE_ADDR Page 9055 in table nh_daily_health, owner: nethealth, database: nethealth, has an incorrect page number: -1. Other page fields: page_stat 0000FFFF, page_log_address (FFFFFFFF,FFFFFFFF), page_tran_id (FFFFFFFFFFFFFFFF). Corrupted page cannot be read into the server cache. 00000201 Thu Mar 01 17:55:46 2001 E_DM9206_BM_BAD_PAGE_NUMBER Page number on page doesn't match its location. 00000201 Thu Mar 01 17:55:46 2001 E_DM93A7_BAD_FILE_PAGE_ADDR Page 9055 in table nh_daily_health, owner: nethealth, database: nethealth, has an incorrect page number: -1. Other page fields: page_stat 0000FFFF, page_log_address (FFFFFFFF,FFFFFFFF), page_tran_id (FFFFFFFFFFFFFFFF). Corrupted page cannot be read into the server cache. 00000201 Thu Mar 01 17:55:46 2001 E_DM920C_BM_BAD_FAULT_PAGE Error faulting a page. 00000201 Thu Mar 01 17:55:46 2001 E_DM9C83_DM0P_CACHEFIX_PAGE An error occurred while fixing a page in the buffer manager. 00000201 Thu Mar 01 17:55:46 2001 E_DM9204_BM_FIX_PAGE_ERROR Error fixing a page. 00000201 Thu Mar 01 17:55:46 2001 E_DM9206_BM_BAD_PAGE_NUMBER Page number on page doesn't match its location. 00000201 Thu Mar 01 17:55:46 2001 E_DM9261_DM1B_GET Error occurred getting a record. 00000201 Thu Mar 01 17:55:46 2001 E_DM904C_ERROR_GETTING_RECORD Error getting a record from database:nethealth, owner:nethealth, table:nh_daily_health. 00000201 Thu Mar 01 17:55:46 2001 E_DM008A_ERROR_GETTING_RECORD Error trying to get a record. ********************************* DATABASE SAVE: Begin processing (1/3/2001 17:00:03). Copying relevant f< iles (1/3/2001 17:00:05). Unloading the data into the files, in directory: 'D:/nethealth/db/save.tdb/'. . . Unloading table nh_daily_exceptions . . . Unloading table nh_daily_health . . . Fatal Internal Error: Unable to execute 'COPY TABLE nh_daily_health () INTO 'D:/nethealth/db/save.tdb/dyh_b40'' (E_CO0048 COPY: Copy terminated abnormally. 0 rows successfully copied because either all rows were duplicates or there was a disk full problem. ). (cdb/DuTable::saveTable) ************************************ DATA ANALYSIS Begin processing (2/3/2001 00:15:09). Error: Sql Error occured during operation (E_QE007C Error trying to get a record. Associated error messages which provide more detailed information about the problem can be found in the error log (errlog.log) (Fri Mar 02 00:24:44 2001) ). Error: Sql Error occured during operation (E_QE007C Error trying to get a record. Associated error messages which provide more detailed information about the problem can be found in the error log (errlog.log) (Fri Mar 02 00:24:44 2001) ). Error: Unable to execute 'MODIFY nh_daily_health TO MERGE' (E_QE0083 Error modifying a table. (Fri Mar 02 00:24:57 2001) ). ************************************ 1. I ran a VerifyDb and I included the output (complaining about incorrect page number of -1) on voyagerii/escalated tickets/46000/46495 2. I also got the size of the nh_daily_health =>aaaabdd.t00 => 37822464 which is not large. 3. The output of this command is on voyagerii: echo "select create_date, num_pages, overflow_pages from iitables where table_name = 'nh_daily_health'\g" | sql $NH_RDBMS_NAME > daily_health.out => There is no num_pages table 4. The output of this command is on voyagerii: echo "help table nh_daily_health \g" | sql $NH_RDBMS_NAME > help.out. 5. Had them try: echo " copy nh_daily_health() into 'nh_daily_health.dat' \g" | sql $NH_RDBMS_NAME, if this succeeds, send us the nh_daily_health.dat. =>output on voyagerii Yulun and Robin are both aware of this issue. 3/15/2001 1:46:12 PM tstachowicz -----Original Message----- From: Stachowicz, Tania Sent: Thursday, March 15, 2001 1:41 PM To: Zhang, Yulun Subject: ticket 13102--Database save, Data analysis and reports failing with nh_daily_health table and page -1 error Yulun, Any news on this one? Thanks, Tania 3/15/2001 3:42:34 PM yzhang It looks they have some hardware problem. the following is a small portion of errlog.log file, which indicated that there are incorrect page number and Corrupted page. Have customer do a disk check, and make sure their disk ok. meanwhile I have created a issue with CA regarding if there is any tuning I can do for this problem. Thanks Yulun 3/19/2001 8:13:17 AM tstachowicz -----Original Message----- From: Stachowicz, Tania Sent: Monday, March 19, 2001 8:08 AM To: Zhang, Yulun Subject: ticket 46495 bug 13102-- Database save, Data analysis and reports failing with nh_daily_health table and page -1 error Importance: High Yulun, The customer has wrote me and sent me the following information: 1. the version.dat and version.rel files 2. They ran scandisk on all drives and there were no errors reported or found. 3. Strange enough the reports for the groups have just run through successfully for last night although nothing has changed. Any idea??? 4. Sent me the latest log files Please let me know the next move, Tania 3/19/2001 11:31:52 AM yzhang Have customer try the following: 1) stop then start ingres from service in control panel, this will clear the cash. 2) sql $NH_RDBMS_NAME 3) help table nh_daily_health \g check the output to see if there is a primary index with this table, If it is, they should see something like this: Column Information: Key Column Name Type Length Nulls Defaults Seq config_id integer 4 no no 3 element_id integer 4 no no 2 aggr_type integer 2 no no 4 variant_type integer 2 no no 5 sample_time integer 4 no no 1 granularity_lvl integer 2 no no If there is existing primary index, do: MODIFY nh_daily_health TO MERGE, this shoutld succeed. After all above, they should be able to do saveDb and do dataanalysis Thanks Yulun 3/22/2001 8:52:40 AM tstachowicz customer ran sql commands. Tried db save again: $NH_HOME/bin/sys/nhiSaveDb -u $NH_USER -d $NH_RDBMS_NAME -p D:/nethealth/db/save.tdb ----- Begin processing (21/3/2001 17:00:56). Copying relevant files (21/3/2001 17:00:57). Unloading the data into the files, in directory: 'D:/nethealth/db/save.tdb/'. . . Unloading table nh_daily_exceptions . . . Unloading table nh_daily_health . . . Fatal Internal Error: Unable to execute 'COPY TABLE nh_daily_health () INTO 'D:/nethealth/db/save.tdb/dyh_b40'' (E_CO0048 COPY: Copy terminated abnormally. 0 rows successfully copied because either all rows were duplicates or there was a disk full problem. ). (cdb/DuTable::saveTable) ----- Scheduled Job ended at '21/3/2001 17:01:17'. 3/22/2001 11:35:57 AM yzhang Tania, I have just called the reseller, He told me they have no problem on their disk and space. You mentioned that they can do help table nh_daily_health, and do modify nh_daily_health to merge, but I don't know if the table has the primary index, and number of rows in the table. As I am looking the nh_daily_health.dat you sent before, it looks nh_daily_health table has lot of duplicates, which might come from nh_stats tables. Collect the following info for me: 1) echo "help \g" | sql nethealth > help.out 2) echo "help table nh_daily_health \g" | sql nethealth > nh_daily_health.out 3) verifydb -mreport -sdbname $NH_RDBMS_NAME -otable nh_daily_health, run this as nhuser, and get iivdb.log 4) echo "copy nh_daily_health() into 'nh_daily_health.dat' \g" | sql nethealth ( I have the old one, but I want the fresh one for this) Make sure get all of the four files. It only take about 15 min for customer collecting all of the data Thanks Yulun 3/26/2001 11:36:50 AM yzhang one of the stats0 table has the duplicate. Please have them run the attched script just by typing the script name after sourcing nethealthrc.csh. 3/27/2001 9:36:36 AM tstachowicz save failed still, after running cleanStats script. 3/27/2001 10:25:31 AM tstachowicz changing the status to assigned 3/27/2001 6:42:16 PM yzhang Tania, Forget the email I sent to you this afternoon, use this email as guidance for prob. 13102. first have them do: sql $NH_RDBMS_NAME create table nh_daily_health_bak as select * from nh_daily_health\g if this succeeds, copy nh_daily_health_bak() into 'nh_daily_health_bak.dat'\g if this succeed: drop table nh_daily_health_bak\g The above just to make sure they have backup before runing the script, even though the script do the back up through the temp table. if this succeeds, run the attached script just by typing the script name, copy the output of the script into a file called stdout.out, then send me the following. The script has been tested. 1) nh_daily_health_bak.dat 2) copy.in 3) copy.out 4)stdout.out 3/29/2001 11:13:10 AM foconnor Files are in //voyagerii/tickets/46000/46495/3-28-01 3/29/2001 1:14:06 PM foconnor Requested this file from customer as per Yulun C:/nethealth/nh_daily.net 3/29/2001 2:03:00 PM yzhang we are currently working with CA, if still don't work, we will have customer drop the table, and custome will lose data 3/30/2001 9:42:42 AM yzhang Good, this one can be de-escalated. 4/2/2001 12:00:53 PM rkeville Descalated per Farrell, appears to be fixed. Awaiting verification from customer. 4/6/2001 8:43:52 AM foconnor Customer says that this issue can be closed 3/19/2001 10:31:00 AM rkeville Customer is Computer Associates, they get Dr. Watson errors on dialog rollu< ps after upgrading to 4.8. I resolved this before on 4.6 P05 with an .exe with a quadrupled stack size. -----Original Message----- From: Keville, Bob Sent: Monday, March 19, 2001 10:23 AM To: Andrews, Dave; Zhang, Yulun Cc: Gray, Don; Trei, Robin Subject: Urgent: Need a nhiDialogRollup for 4.8 that has additional stack size. Importance: High Hi all, Computer Associates is again threatening to drop us due to TA issues. They are again getting Dr. Watson errors on dialog rollups. I corrected this 2 weeks ago in 4.6 P05 by running an editbin command on the nhiDialogRollup.exe and sending it to them. I will need this done again to resolve this issue. Dave, - can you provide access to this exe to Yulun, please? Yulun, - would you run the editbin on the exe and supply this to support for call ticket - 45046? editbin /stack:8388608 nhiDialogRollup.exe Thanks, -Bob ############################################################# 3/19/2001 10:32:45 AM rkeville -----Original Message----- From: Trei, Robin Sent: Monday, March 19, 2001 10:27 AM To: Keville, Bob; Andrews, Dave; Zhang, Yulun Cc: Gray, Don Subject: RE: Urgent: Need a nhiDialogRollup for 4.8 that has additional stack size. And jus tto let everyone know. Jay has done some work that should help this problem... I am tracking down what version it is going in, etc- so we are working a permanent fix 3/20/2001 2:14:49 PM yzhang waiting customer for installingl the new kit 3/21/2001 4:05:33 PM yzhang This looks the console GUI performance problem, the customer took several minutes to have setup menu coming up from console. I suggest you assign this to the console people. Thanks Yulun 3/22/2001 12:33:14 PM yzhang Walter, Let's get the update from customer for this. 1) check that they have installed nh48 and patch2 successfully. 2) find out if they still have problem opening the setup menu from console, and how long it takes 3) find out their current transaction log size 4) run nhDbStatus, and get the output. 5) as far as I know they have set nh_unref_node_limit, and the other environmental variable to get ride of extra nodes. if they still have over 700,000 nodes, we need to run cleanNodes script, then test the conversation rollup. 6) echo "select node_elem_id, count (*) from nh_address group by node_elem_id having count (*) > 1 Thanks Yulun 3/22/2001 4:12:18 PM yzhang the file you attached is empt. the request you sent to customer on syep 3 should be: 3. echo "select node_elem_id, count (*) from nh_address group by node_elem_id having count (*) > 1\g" | sql nethealth > duplicate_node_elem_id.out I think I did not make this clear from my last email. also get the output of nhDbStatus. Thanks 3/26/2001 6:12:27 PM yzhang Talked to customer about their problems, current status, and some parameters: no. of nodes is about 50,000, transaction log size 2GB, there is no node_elem_id in duplicate from nh_address table. Their installation of nh48 and P2 was OK. they now can open the setup menu from console immediately, they can run a report in a few minutes, and the poller is running. We just want to see if every thing will continue working fine. Yulun 3/27/2001 12:15:57 PM yzhang Walter; I talked with Robin, she suggested that we need to see if the table was indexed properly, can you get the following, and send me the two files. Thanks echo "select table_name, number_rows, number_pages, overflow_pages from iitables \g" | sql $NH_RDBMS_NAME > iitable.out echo " help \g" | sql $NH_RDBMS_NAME > help.out 3/27/2001 4:12:24 PM wburke obtained 3/28/2001 3:58:08 PM yzhang From the file attached, it looks that the index and storage structure are fine. But there is no data in any of the splitted DAC table, which make me wonder if their conversion was succeeded. Walter can you get the following: 1) install.log for nh48 installation 2) nh48 patch install.log 4)echo "select * from nh_schema_version\g" | sql nethealth > version.out 5)echo "select count (*) from nh_node_addr_pair\g" | sql nethealth > node_addr_pair.out 6)put their nh46 backup (the database before upgrade to nh48)into out ~ftp/incoming site Try to get these information before tommorow's meeting. Thanks Yulun 3/29/2001 1:37:13 PM wburke Obtained 3/30/2001 1:20:13 PM yzhang The problem has been evaluated yesterday in a meeting 3/30/2001 1:22:17 PM yzhang The problem has been evaluated yesterday in a meeting. either ingres or TA data might cause the problem. Has asked the support to collect the information 3/30/2001 2:52:42 PM yzhang changed to moreinfo 3/30/2001 3:03:14 PM yzhang Thanks for the information you obtained from CA. From the errlog.log, I noticed that there are some operating system error, also error on deleting files. Can you have them run some disk check, to make sure the disk, disk drive are OK. Thanks Yulun 4/2/2001 4:22:56 PM yzhang When I was talking to CA, they told me that you have asked them destroy and reload the db. Looks they might back to work now. Thanks Yulun 4/2/2001 6:02:54 PM wburke NO such luck. same issues. 4/4/2001 12:21:43 PM yzhang waiting for more information 4/4/2001 3:10:39 PM wburke Dr. Watson Failure are gone. Close Bug ticket. 4/4/2001 5:01:37 PM yzhang Dr. Watson Failure are gone. Close Bug ticket. 3/21/2001 4:38:08 PM jpoblete Customer: Sprint. Statistics Rollups failing with the following error: Begin processing (2001/03/20 20:00:40). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Tue Mar 20 20:09:34 2001) ). We tried several times the cleanStats.sh script with no luck, the output of the script shows the following: nh_stats0_983818799 nh_stats0_983869199 nh_stats0_983872799 nh_stats0_983876399 nh_stats0_983879999 nh_stats0_983883599 nh_stats0_983887199 nh_stats0_983890799 nh_stats0_983894399 nh_stats0_983897999 nh_stats0_983901599 nh_stats0_983908799 nh_stats0_983912399 nh_stats0_983915999 nh_stats0_983919599 nh_stats0_983923199 nh_stats0_983926799 nh_stats0_983930399 nh_stats0_983933999 nh_stats0_984477599 nh_stats0_984481199 nh_stats0_984484799 nh_stats0_984488399 nh_stats0_984491999 Collected nhiIndexDiag output, Statistics Rollup log, all log files on $II_SYSTEM/ingres/files, the echo "help\g" | sql nethealth output and the system log, there was no files on the $NH_HOME/tmp directory. Additional data is available from the customer upon request. The collected data is available on: \\voyagerii\Escalated Tickets\46827\Data3-21 3/22/2001 2:41:45 PM wzingher reassigning to Robin 3/29/2001 1:15:44 PM apier We need engineering assitance. The cleanstats script does not resolve this problem.. Tony 3/29/2001 6:15:57 PM yzhang make a new directory under $NH_HOME called test_dup, copy the attached script nhCleanDupStats.sh there, change the mode to 777 , run the script just by typing the script name after login as nhuser and have nethealthrc.csh sourced. Keep the files created under the test_dup. after running the script, send me the following file: echo "help \g" | sql nethealth > help.out. Thanks Yulun 3/31/2001 11:41:38 AM yzhang Jose, I checked the help.out and the tar file you placed on the escalation directory. It looks the duplicates have been removed. You can have customer do the stats rollup now, and let me know the result. Thanks Yulun 4/2/2001 4:40:31 PM yzhang The stats rollup succeeded. c3/23/2001 9:05:26 AM cestep The Nethealth console dbstatus shows that Converstations Rollups have been failing, but the logs don't reflect this. Had the customer try to nhiDialogRollup from the command line and received a Segementation fault error, which produced a core dump. This customer has recently upgraded from Nethealth 4.5.1 to 4.7.1. Conversations Rollups have been failing since they were running Nethealth 4.5.1. Customer's core dump is on voyagerii. 3/26/2001 2:21:38 PM cestep So< lved this problem by unlimiting the stack size. Moving this to "No bug" and closing the call ticket. 3/23/2001 10:01:12 AM rrick Hi Yulun, Yesterday you and Sheldon asked me to have this customer, Alex, to perform the following: setenv II_EMBED_SET rerun install of version 4.6 It turns out that his 4.5.1 system was in tough shape and he ran out of disk space. So we clear the disk and installed 4.6 from scratch. Then brought up 4.6. This went well. Then we loaded the 4.5.1 database via nhLoadDb and it ran for a few hours and the customer ended up with the error in the following file: NOTE: Please bring up in vi! Can you please help. There is definitely something wrong with their database save. Thanks again, Russell K. Rick, CNE, CUA Error: Loading table nh_var_units . . . Loading table nh_hourly_health . . . Loading the sample data . . . Updating a prior version 4.5 database . . . Fatal database error: Step 3 in rev 22 22-Mar-2001 20:30:30 - Database error: -33000, E_CO003B COPY: Error writing to Copy File while processing row 182407. Load of database 'nethealth' for user 'neth' was unsuccessful. Error: The program nhiLoadDb failed. 3/23/2001 11:34:10 AM rrick NILM with Yulun to call me back. 3/23/2001 1:21:11 PM yzhang You did not specify the value of the environmental variable,II_EMBED_SET. it should be like this: setenv II_EMBED_SET printqry, after setting this, you run the install, when the install hit the fatal error, check iiprtqry.log file under the directory where you run the installation. Send iiprtqry.log to me , thanks Yulun 3/27/2001 2:02:07 PM rrick Hi Don, I just spoke with the customer and he was not able to produce the file that Yulun wanted. I am going to write up the proceedure for this customer and have him do this again. We had spoke over the phone last week when John Witty was on-site and the customer says he may have misinterpreted my instructions over the phone. Hope this helps, Russell K. Rick, CNE, CUA 3/27/2001 2:21:12 PM rrick Hi Alex, Please perform the following: 1. Change the following Nethealth Environment Variables in the $NH_HOME/nethealthrc.csh, $NH_HOME/nethealthrc.sh, and $NH_HOME/nethealthrc.ksh (only if you have one and are running Network Health in the Korn Shell): setenv II_EMBED_SET printqry NOTE: It is not recommended to make changes to the actual Nethealth Environment Variables in the $NH_HOME/nethealthrc.csh, $NH_HOME/nethealthrc.sh, and $NH_HOME/nethealthrc.ksh file. What is recommeded is to copy the lines in these files and paste them to the $NH_HOME/nethealthrc.csh.usr, $NH_HOME/nethealthrc.sh.usr, and $NH_HOME/nethealthrc.ksh.usr (only if you have one and are running Network Health in the Korn Shell) and then change the values. The $NH_HOME/nethealthrc.csh.usr, $NH_HOME/nethealthrc.sh.usr, and $NH_HOME/nethealthrc.ksh.usr (only if you have one and are running Network Health in the Korn Shell) will override the $NH_HOME/nethealthrc.csh, $NH_HOME/nethealthrc.sh, and $NH_HOME/nethealthrc.ksh environment variable values. 2. source the $NH_HOME/nethealthrc.csh, $NH_HOME/nethealthrc.sh, and $NH_HOME/nethealthrc.ksh (only if you have one and are running Network Health in the Korn Shell) files. 3. Re-run the nhLoadDb. 4. When the nhLoadDb hits the fatal error, check iiprtqry.log file under the directory where you ran the nhLoadDb. 5. Send iiprtqry.log to support@concord.com, attn: Russ Rick. Thanks again for all your patience, Russell K. Rick, CNE, CUA 3/30/2001 2:30:11 PM yzhang Russell, I was trying to contact Alex berry (the customer), His message indicated he will be out next week. So can you check with him about the following today: 1) did he ran nhLoadDb with setting II_EMBED_SET, if he did, send us the iiprtqry.log file. 2) if he did not run, check the following: a) get the information about the size and permision for $TMPDIR ls -ld $TMPDIR > tmpdir.out df -k . > space.out b) find out if they are doing command dbload or console dbload. c) cd to the dbbackup directory, unzip smt_b23.zip to smt_b23, then do: echo "copy table nh_element() from 'smt_b23' \g" | sql $NH_RDBMS_NAME 3) get me the errlog.log and smt_b23 Let me know how each step going. Thanks Yulun 4/2/2001 10:44:23 AM rkeville Descalated as customer is not responding. 4/2/2001 11:53:28 AM rrick -----Original Message----- From: Zhang, Yulun Sent: Friday, March 30, 2001 2:25 PM To: Rick, Russell Cc: 'abberry@emory.edu' Subject: problem 13224 /call ticket 47150 Russell, I was trying to contact Alex berry (the customer), His message indicated he will be out next week. So can you check with him about the following today: 1) did he ran nhLoadDb with setting II_EMBED_SET, if he did, send us the iiprtqry.log file. 2) if he did not run, check the following: a) get the information about the size and permision for $TMPDIR ls -ld $TMPDIR > tmpdir.out df -k . > space.out b) find out if they are doing command dbload or console dbload. c) cd to the dbbackup directory, unzip smt_b23.zip to smt_b23, then do: echo "copy table nh_element() from 'smt_b23' \g" | sql $NH_RDBMS_NAME 3) get me the errlog.log and smt_b23 Let me know how each step going. Thanks -----Original Message----- From: Rick, Russell Sent: Monday, April 02, 2001 11:46 AM To: Zhang, Yulun Subject: RE: problem 13224 /call ticket 47150 Yulun, Since Alex has not gotten back to me, did not send in the the log file, and is on vacation this week, I have de-escalated this ticket until he gets back. Hope this helps, Russell K. Rick, CNE, CUA 4/2/2001 11:53:39 AM rrick . 4/20/2001 11:22:31 AM yzhang Can you check with customer to see if they have done the following. 1) did he ran nhLoadDb with setting II_EMBED_SET, if he did, send us the iiprtqry.log file. 2) if he did not run, check the following: a) get the information about the size and permision for $TMPDIR ls -ld $TMPDIR > tmpdir.out df -k . > space.out b) find out if they are doing command dbload or console dbload. c) cd to the dbbackup directory, unzip smt_b23.zip to smt_b23, then do: echo "copy table nh_element() from 'smt_b23' \g" | sql $NH_RDBMS_NAME 3) get me the errlog.log and smt_b23 Let me know how each step going. Thanks 5/16/2001 10:40:31 AM yzhang This one has been there for long time, check with customer if still has the problem, or close it. 5/16/2001 12:43:51 PM rrick Customer never responded. Closing ticket. @3/26/2001 9:37:32 AM foconnor Customer is getting checksum errors again: output of nhCollectCustData can be found: //voyagerii/tickets/43000/43090/March26 keg% whattime 981575999 (offending table) Wed Feb 7 14:59:59 2001 slsanh ::[ingres , 00000440]: Sat Feb 10 19:16:42 2001 E_CL2530_CS_PARAM default_page_size = 2048 slsanh ::[ingres , 00000440]: Sat Feb 10 19:16:42 2001 E_CL2530_CS_PARAM sec_label_cache = 100 SLSANH ::[49168 , 400e0a30]: Sat Feb 10 19:16:43 2001 E_SC0129_SERVER_UP Ingres Release OI 2.0/9712 (hp8.us5/00) Server -- Normal Startup. SLSANH ::[49168 , 40a2f520]: Mon Feb 19 05:35:27 2001 E_DM930C_DM0P_PAGE_CHKSUM_FAIL Page Checksum failure detected. Database nethealth, Table nh_stats0_981575999, Page 274. SLSANH ::[49168 , 40a22120]: Tue Feb 20 03:11:02 2001 E_DM910A_DM1U_VERIFY_INFO Information: A user is running either the report, repair or patch table operation on table nh_stats0_981575999, owner: lehberg, database: nethealth, operation type: 000000F1. SLSANH ::[49168 , 409b8040]: Thu Feb 22 20:06:06 2001 E_SC0216_QEF_ERROR Error returned by QEF. SLSANH ::[49168 , 409b8040]: Thu Feb 22 20:06:06 2001 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. Associated error messages which provide more detailed information about the problem can be found in the error log, II_CONFIG:errlog.log SLSANH ::[49168 , 409b8040]: Thu Feb 22 20:06:06 2001 E_SC0216_QEF_< ERROR Error returned by QEF. SLSANH ::[49168 , 409b8040]: Thu Feb 22 20:06:06 2001 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. Associated error messages which provide more detailed information about the problem can be found in the error log, II_CONFIG:errlog.log SLSANH ::[49168 , 409b8040]: Thu Mar 1 21:04:11 2001 E_SC0216_QEF_ERROR Error returned by QEF. SLSANH ::[49168 , 409b8040]: Thu Mar 1 21:04:11 2001 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. Associated error messages which provide more detailed information about the problem can be found in the error log, II_CONFIG:errlog.log SLSANH ::[49168 , 409b8040]: Thu Mar 1 21:04:11 2001 E_SC0216_QEF_ERROR Error returned by QEF. SLSANH ::[49168 , 409b8040]: Thu Mar 1 21:04:11 2001 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. Associated error messages which provide more detailed information about the problem can be found in the error log, II_CONFIG:errlog.log SLSANH ::[49168 , 409ced20]: Mon Mar 19 19:12:19 2001 E_SC0216_QEF_ERROR Error returned by QEF. SLSANH ::[49168 , 409ced20]: Mon Mar 19 19:12:19 2001 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. Associated error messages which provide more detailed information about the problem can be found in the error log, II_CONFIG:errlog.log SLSANH ::[49168 , 409ced20]: Mon Mar 19 19:12:19 2001 E_SC0216_QEF_ERROR Error returned by QEF. SLSANH ::[49168 , 409ced20]: Mon Mar 19 19:12:19 2001 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. Associated error messages which provide more detailed information about the problem can be found in the error log, II_CONFIG:errlog.log SLSANH ::[49168 , 40a31060]: Wed Mar 21 07:41:00 2001 E_DM930C_DM0P_PAGE_CHKSUM_FAIL Page Checksum failure detected. Database nethealth, Table nh_stats0_984527999 Server information: OS and patch level, especially any information about the disk drives. HP 9000 N 4000-36 running HP-UX 11.00 Is this 64 bit? The disk is mirrored with HP-Mirrordisk Utility. Product Id: ST39103LC Vendor: SEAGATE Device Type: SCSI Disk Firmware Rev: HP01 Device Qualifier: SEAGATEST39103LC Logical Unit: 0 Serial Number: LSA88333000010283TBZ Capacity (M Byte): 8683.16 Block Size: 512 Max Block Address: 17783111 Error Logs Read Errors: 0 Buffer Overruns: N/A Read Reverse Errors: N/A Buffer Underruns: N/A Write Errors: 0 Non-Medium Errors: 0 Verify Errors: 0 Attached information: - Volume groups --- VG Name /dev/vg00 VG Write Access read/write VG Status available Max LV 255 Cur LV 10 Open LV 10 Max PV 16 Cur PV 2 Act PV 2 Max PE per PV 2500 VGDA 4 PE Size (Mbytes) 4 Total PE 4338 Alloc PE 4016 Free PE 322 Total PVG 0 Total Spare PVs 0 Total Spare PVs in use 0 --- Logical volumes --- LV Name /dev/vg00/lvol1 LV Status available/syncd LV Size (Mbytes) 84 Current LE 21 Allocated PE 42 Used PV 2 LV Name /dev/vg00/lvol2 LV Status available/syncd LV Size (Mbytes) 1024 Current LE 256 Allocated PE 512 Used PV 2 LV Name /dev/vg00/lvol3 LV Status available/syncd LV Size (Mbytes) 140 Current LE 35 Allocated PE 70 Used PV 2 LV Name /dev/vg00/lvol4 LV Status available/syncd LV Size (Mbytes) 100 Current LE 25 Allocated PE 50 Used PV 2 LV Name /dev/vg00/lvol5 LV Status available/syncd LV Size (Mbytes) 20 Current LE 5 Allocated PE 10 Used PV 2 LV Name /dev/vg00/lvol6 LV Status available/syncd LV Size (Mbytes) 2000 Current LE 500 Allocated PE 1000 Used PV 2 LV Name /dev/vg00/lvol7 LV Status available/syncd LV Size (Mbytes) 1064 Current LE 266 Allocated PE 532 Used PV 2 LV Name /dev/vg00/lvol8 LV Status available/syncd LV Size (Mbytes) 2000 Current LE 500 Allocated PE 1000 Used PV 2 LV Name /dev/vg00/lvol9 LV Status available/syncd LV Size (Mbytes) 600 Current LE 150 Allocated PE 300 Used PV 2 LV Name /dev/vg00/lvol10 LV Status available/syncd LV Size (Mbytes) 1000 Current LE 250 Allocated PE 500 Used PV 2 --- Physical volumes --- PV Name /dev/dsk/c1t6d0 PV Status available Total PE 2169 Free PE 161 Autoswitch On PV Name /dev/dsk/c4t6d0 PV Status available Total PE 2169 Free PE 161 Autoswitch On VG Name /dev/vg01 VG Write Access read/write VG Status available Max LV 255 Cur LV 1 Open LV 1 Max PV 16 Cur PV 2 Act PV 2 Max PE per PV 2170 VGDA 4 PE Size (Mbytes) 4 Total PE 4340 Alloc PE 1500 Free PE 2840 Total PVG 0 Total Spare PVs 0 Total Spare PVs in use 0 --- Logical volumes --- LV Name /dev/vg01/lvol1 LV Status available/syncd LV Size (Mbytes) 3000 Current LE 750 Allocated PE 1500 Used PV 2 --- Physical volumes --- PV Name /dev/dsk/c2t6d0 P< V Status available Total PE 2170 Free PE 1420 Autoswitch On PV Name /dev/dsk/c4t5d0 PV Status available Total PE 2170 Free PE 1420 Autoswitch On 3/30/2001 10:36:23 AM lemmon Reassigned to Yulin Zhang 3/30/2001 11:18:53 AM yzhang The following is the email Farrell sent to the reseller reagrding what our view on the checksum error. and we are waiting to their response on this. We have reviewed this issue with our database team and here are some of the recommendations and comments as a result of that meeting 1) We have logged a bug with CA previously and the only thing we are getting from CA is that checksum errors are a result of a miscommunication of reading and writing files to and/or from the disk. In their words "The error message is generated whenever query tries to access the table." We can log a bug again, but our expectations are that we will not get anything new. 2) Is Alcanet willing to change their hardware? Can they change from what they have now to a directly attached SCSI device for their database? Are experience has shown that one of the reasons for checksum errors is either disk problems (does not appear to be the case here) or hardware incompatability problems between ingres and the hardware. Can they run a test system with a different hard drive with the same database? 3) Upgrading to Network Health 4.8. We have no expectations that this will change the situation, other than CA addressing some problems the newer ingres that installs with 4.8 is nearly the same. Can they at least load their database on a test system with 4.8? 4) Can Alcanet live with this issue? We have examined several logs and debug output and the only thing that appears to be off is the checksum errors that occur around the 20 day of each month. Our experience with checksum errors are that when things go wrong we see plenty of checksum errors continuously until something fails or we see the occasional checksum error which is usually a result of a bad spot on the disk. Alcanet's disks seem to always pass the disk hardware checks and there does not appear to be anything else wrong. 5) As mentioned above(4), we have noticed a pattern that the checksums occur around the 20th of each month (give or take a day). Checksums have occurred from Jan 18-20, Feb 19, and March 21, is there anything else running during these times? Is there a monthly job scheduled to run on this server? Thank You! 4/4/2001 10:04:43 AM foconnor Spoke to reseller and reseller says the customer is not happy with the answers in the email above. They are concerned with data integrity. 4/6/2001 6:07:53 AM foconnor -----Original Message----- From: ICS Support [mailto:support@ics.de] Sent: Friday, April 06, 2001 4:26 AM To: support@concord.com Subject: TICKET CLOSURE #002961 43090 open,(46053 closed) Checksum failure in Ingres-Log ####################################################################### # # # I C S - N e t w o r k H e a l t h - T i c k e t # # ========================== # ####################################################################### Dear Concord Support, for our Customer Alcanet GmbH, Stuttgart we CLOSED the following Trouble Report: =============================================================== Checksum failure in Ingres-Log =============================================================== ICS REQUEST ID: 002961 YOUR TICKET / LOG ID: 43090 open,(46053 closed) CONTACT-DATA: ================= SUBMITTER: CSC CUSTOMER NAME: Alcanet GmbH TOWN: Stuttgart PLATFORM DATA: ============== HOST NAME: slsanh HOST ID: 634339352 IP ADDRESS: 149.204.45.43 VENDOR: HP OPERATING SYSTEM: HP-UX 11.0 64bit HARDWARE MODEL: 9000/800/N4000-36 MEMORY: 1 GB LICENSE DATA: ============= ICS CONTRACT ID: 000700 CUSTOMER/CONTRACT ID: 001263 VERSION: 4.7.1 INSTALLED PATCHES: Patch:P1 D02 Custom Scripts by ICS ADDITIONAL INFORMATION: Number of Elements: nhShowRev Network Health version: 4.7.1.D02 D0 - Patch Level: 01 ICS REQUEST DATA: ================= TYPE: Problem PRIORITY: 2 STATUS: Closed LONG DESCRIPTION: This ticket was closed yesterday, but the error re-occured today: Our customer sees the following error messages in the Ingres error-log: -------------------------------------- SLSANH ::[49168 , 400e0d68]: Tue Nov 28 08:10:46 2000 E_SC0129_SERVER_UP Ingres Release OI 2.0/9712 (hp8.us5/00) Server -- Normal Startup. SLSANH ::[49168 , 409d4120]: Sun Dec 3 03:00:47 2000 E_CLFE06_BS_WRITE_ERR Write to peer process failed; it may have exited. SLSANH ::[49168 , 40ac2040]: Tue Dec 5 14:01:14 2000 E_DM930C_DM0P_PAGE_CHKSUM_FAIL Page Checksum failure detected. Database nethealth, Table nh_stats0_973508399, Page 10. SLSANH ::[49168 , 40ac2040]: Tue Dec 5 14:01:14 2000 E_DM930C_DM0P_PAGE_CHKSUM_FAIL Page Checksum failure detected. Database nethealth, Table nh_stats0_973508399, Page 10. --------------------------------- The error message is then repeated about 20 times, exactly the same. The time when this message occurs is the time of the scheduled Rollup. What does this error message mean? What are the implications on the systems integrity? Can this be the beginning of a crash scenario? This request may be related to escalated call ticket 42641/11742. Therefore it is of the same importance and should be given the same high priority ! SOLUTION: as there is nothing more what Concord can do for us, we will close this ticket by now. We will see, if the situation improves after upgrading to Version 4.8 and customer is looking forward to use Oracle in future releases. --------------------------------------------------------------- ICS Support E-Mail : support@ics.de Phone : ++49 - 89 - 74 85 98 90 Fax : ++49 - 89 - 78 31 06 Web : http://www.ics.de ICS GmbH, Kistlerhofstrasse 111, 81379 Muenchen --------------------------------------------------------------- 4/6/2001 9:02:20 AM yzhang ticket closed 4/3/2001 9:42:57 AM cestep Customer was on 4.5.1 patch 11. Database keeps growing - currently 6 GB with only 2000 elements being polled. Had the customer reduce the amount of time to keep as polled data from 30 days to 2 days. Checking Dialog and Stats rollup logs, show that the rollups take a little over 20 seconds to complete and no errors. Looking in errlog.log, see very few errors - some errors from SCF subsystem, but then recovers. Still have as polled data from April, 2000. Had the customer load patch 14. Ran IndexDiag, which had corrupted output. Needed to have the customer turn off polling because he is running out of space. 4/3/2001 11:52:30 AM yzhang I looked at the information you posted on the escalation directory. nhiindexdlg is a scheduled job or they ran it from command line? Can you have the customer do the following: 1) there are two stats0 tables which have duplicate entries, we need to clean it. have customer run the attched script just by typing the script name after sourcing the nethealthrc.csh. Then run the stats rollup. 2) echo " select table_name, create_date, table_indexes, num_rows, storage_structure, number_pages, overflow_pages from iitables order by table_name ;\g" | sql $NH_RDBMS_NAME > iitable.out. Thanks Yulun 4/6/2001 11:38:20 AM cestep Issue resolved, leaving for Yulun to close this ticket. 4/6/2001 11:40:14 AM yzhang probl< em solved 4/9/2001 12:11:36 PM wburke database:nethealth table:nh_element pathname:/nh04/idb/ingres/data/default/nethealth filename:aaaaaald.t00 ITP0ZN13::[46081 , 0000001c]: Tue Mar 27 14:46:13 2001 E_DMA00D_TOO_MANY_LOG_LOCKS This lock list cannot acquire any more logical locks. The lock list status is 00000000, and the lock request flags were 00000008. The lock list currently holds 700 logical locks, and the maximum number of locks allowed is 700. The configuration parameter controlling this resource is ii.*.rcp.lock.per_tx_limit. ITP0ZN13::[46081 , 0000001c]: Tue Mar 27 14:46:13 2001 E_DM004B_LOCK_QUOTA_EXCEEDED Lock quota exceeded. ITP0ZN13::[46081 , 0000001c]: Tue Mar 27 14:49:19 2001 E_DMA00D_TOO_MANY_LOG_LOCKS This lock list cannot acquire any more logical locks. The lock list status is 00000000, and the lock request flags were 00000008. The lock list currently holds 700 logical locks, and the maximum number of locks allowed is 700. The configuration parameter controlling this resource is ii.*.rcp.lock.per_tx_limit. ITP0ZN13::[46081 , 0000001c]: Tue Mar 27 14:49:19 2001 E_DM004B_LOCK_QUOTA_EXCEEDED Lock quota exceeded. ITP0ZN13::[46081 , 00001522]: Wed Apr 4 18:25:01 2001 E_CL0608_DI_BADEXTEND Error allocating disk space write() failed with operating system error 0 (Error 0) ITP0ZN13::[46081 , 00001522]: Wed Apr 4 18:25:01 2001 E_DM9000_BAD_FILE_ALLOCATE Disk file allocation error on database:nethealth table:nh_element pathname:/nh04/idb/ingres/data/default/nethealth filename:aaaaaald.t00 write() failed with operating system error 0 (Error 0) ITP0ZN13::[46081 , 00001522]: Wed Apr 4 18:25:01 2001 E_DM92CB_DM1P_ERROR_INFO An error occurred while using the Space Management Scheme on table: nh_element, database: nethealth ITP0ZN13::[46081 , 00001522]: Wed Apr 4 18:25:01 2001 E_DM92CF_DM2F_GALLOC_ERROR Error allocating space in physical Select count (*) from nh_element\g = 31,000 elements 4/9/2001 1:26:18 PM wburke Size of /nh04/idb/ingres/data/default/nethealth filename:aaaaaald.t00 = 2.17 gB 4/9/2001 3:58:47 PM yzhang Obtain the following: 1) echo "select count (*) from nh_element;\g" | sql $NH_RDBMS_NAME > nh_element_count.out 2) echo "select count (*) from nh_element where element_class = 2;\g" | sql $NH_RDBMS_NAME > nh_element_node.out 3) echo "help tabel nh_element;\g" | sql $NH_RDBMS_NAME > nh_element_help.out 4) echo " select file_name from iifile_info where table_name = 'nh_element';\g" | sql $NH_RDBMS_NAME > element_file_name.out. 5) find out the file name from element_file_name.out, then do ls -l on the file name in $II_SYSTEM/ingres/data/default/nethealth. tell us the file size. 6) echo "help;\g" | sql $NH_RDBMS_NAME > help.out 7) echo " select table_name, create_date, num_rows, number_pages, overflow_pages, storage_structrue, is_compressed, allocation_size, allocated_pages from iitables;\g" | sql $NH_RDBMS_NAME > iitable.out 4/9/2001 6:12:03 PM yzhang Walter, have customer do the following: 1) run the attached btree_elemet_13501.sh to reconstruct nh_element index and storage structrue. The script has been tested. make sure do a source nethealthrc.csh before running the script. 2) I noticed there are many stats0 tables contain duplicates ( It may be not a duplicate but the fetched stats0 tables have been indexed), anyway they can run the attached nhcleandupstats.sh just by typing the script name. 3) Afetr doing the above, they need to remove the whole remotePoller directory under $NH_HOME/db, and remove file called .file_copied from Remote.tdb directory in any of the remote sites. Then they can run nhFetchDb with following command: sh -x nhFetchDb ...........(whatever their parameters) >& FetchDb.out. send me FetchDb.out Hope these work. Thanks Yulun 4/11/2001 3:41:05 PM wburke Fet5ch created duplicates with elemnt IDs in both 1,000,000 range and 5,000,000 range. 4/13/2001 12:37:10 PM wburke The fetch was succesful and there are up and polling. 4/20/2001 11:26:11 AM yzhang Do you know if there is any progress on this one. 4/20/2001 11:45:25 AM wburke need to determine why nxt_hndl jumped from 1 million to 5 million during middle of fetch. 4/20/2001 11:45:45 AM wburke Yulun, Customer is backj up and running. We need to determine why the HDL table changed nxt_hndl from 1 million to 5 million. 5/24/2001 2:35:45 PM yzhang some test required to verify that this is a problem 11/20/2001 11:41:34 AM yzhang Jose, Originally this is a nh_element hitting 2GB problem, I noticed from your update that this has been resoved, If they still concern about the wrong entries in HDL table, new problem ticket should be created. Thanks Yulun W4/11/2001 10:33:55 AM jpoblete OS: Solaris 2.7 RAM: 2Gb Swap: 4Gb Proc's: 2 @ 400 Mhz Ref Nodes: 602325 Unref Nodes: 239209 Address Pairs: 3209242 Conversation Rollup fais with the following error: ----- Job started by Scheduler at '04/10/2001 00:05:01'. ----- ----- $NH_HOME/bin/sys/nhiDialogRollup -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (04/10/2001 00:05:03). Error: Sql Error occured during operation (E_US1264 The query has been aborted. (Tue Apr 10 07:36:58 2001) ). ----- Scheduled Job ended at '04/10/2001 12:11:46'. ----- The Ingres errlog.log shows the following messages for the referenced timestamp: NETOPS3 ::[33131 , 00000035]: Mon Apr 9 18:46:42 2001 E_GC0005_INV_ASSOC_ID Invalid association identifier NETOPS3 ::[33131 , 00000133]: Tue Apr 10 07:36:58 2001 E_QE0022_QUERY_ABORTED The query has been aborted. NETOPS3 ::[33131 , 00000133]: Tue Apr 10 07:36:58 2001 E_DM9059_TRAN_FORCE_ABORT The transaction (00003ACD, 3AD4B657) in database nethealth is about to be force aborted. Further messages may or may not follow describing the force abort in more detail. NETOPS3 ::[33131 , 00000133]: Tue Apr 10 11:55:49 2001 E_DM9059_TRAN_FORCE_ABORT The transaction (00003ACD, 3AD4B657) in database nethealth is about to be force aborted. Further messages may or may not follow describing the force abort in more detail. Have tried the following: - Collect advanced logging of the process, it will create a huge file, and after 7.5 hours of processing, it will just stop without error. The advanced logging traces are in the call ticket directory if needed. - unlimit the stacksize, still got the error: E_US1264 The query has been aborted. - Resized the ingres trasaction log to 2Gb, still got the error: E_US1264 The query has been aborted. 4/23/2001 9:24:59 AM lemmon Reassigned to Yulin Zhang 5/10/2001 11:32:18 AM jpoblete Yulun, As per our discussion we had yesterday, I have wrote a script to run the Conversation Rollups in debug with the oprion -now specifying the date and time of when is the rollup supposed to stop processing. If we run nhiDialogRollup without any argument, the rollup will run forever if we started the rollup at 11:30 AM, and we look at the status of the process at 4:30 P.M. it looks like it's trying the rolled up data collected at 4:20 PM. If we set the option -now and specify a date and time of when the rollup has to finish, the rollup will run OK, without further incident. 5/10/2001 11:34:13 AM jpoblete Yulun, As per our discussion we had yesterday, I have wrote a script to run the Conversation Rollups in debug with the oprion -now specifying the date and time of when is the rollup supposed to stop processing. If we run nhiDialogRollup without any argument, the rollup will run forever if we started the rollup at 11:30 AM, and we look at the status of the process at 4:30 P.M. it looks like it's trying the rolled up data collected at 4:20 PM. If we set the option -now and specify a date and time of when the rollup has to finish, the rollup will run OK, without further incident. 5/15/2001 12:51:26 PM wburke Walter- I created a Scheduled Job and used a script Jose sent to perform the rollups. The job i< s completing now within 2 hours, a lot shorter than the 10-12 hours a few months ago. The job runs once per day. Should I continue to run this new job or should I enable the original Dialog Rollup job? 5/24/2001 2:37:57 PM yzhang use the script Jose sent to you if the rollup hangs 7/18/2001 5:05:02 PM yzhang Robin and Jay, These two problems looks the same, the conversation rollup hangs, but the rollup will be running ok if the customer run the rollup with -now option (specify the date and time it should rollup to) after resetting nh_run_step, and nh_run_scheule and setting the following three environment variables: NH_DBG_OUTPATH; NH_DBG_OUTPATH="$NH_HOME/tmp" NH_UNREF_NODE_LIMIT; NH_UNREF_NODE_LIMIT=3 NH_POLL_DLG_BPM; NH_POLL_DLG_BPM=2500 Jose has placed all of these into the following script, now the both customers ran the script, and they are up and running. The script covers most of the workaround we normally recommend to customer for conversation rollup problem. My question is that are we going to corporate some of the workaround into our conversation rollup source code, or just polish the attached script so that customer can run it when encountering the problem. Thanks Yulun 7/23/2001 10:00:41 AM yzhang closed, this is an issue as prob. 12692 4/19/2001 1:36:24 PM jpoblete NH 4.5.1 P11 on Solaris. Customer got disk full because the Statistics Rollups have been failing for a while, the error is: Begin processing (04/19/2001 12:07:51 PM). Error: Append to table nh_stats1_981953999 failed, see the Ingres error log file for more information (E_CO0048 COPY: Copy terminated abnormally. 0 rows successfully copied because either all rows were duplicates or there was a disk full problem.). We got the min and max sample times and dropped 6 nh_stats0 tables in the range of the table nh_stats1_981953999, then re ran again the Statistics Rollup by Hand, it fails appending to the next stats1 table. Collected the DB, system log and errlog.log. Discussed this with Yulun Zhang to see if this can be fixed within support, he said the DB team would need to fix this since it's beyond support. 4/23/2001 9:27:32 AM lemmon reassigned to yulin zhang 4/23/2001 11:00:13 AM jpoblete Yulun: I wrote a script to locate how many tables have problems here: stats1 tables: 37 stats0 tables: 222 I'll e-mail you the file with the results. 4/23/2001 11:32:03 AM yzhang Thanks for your effort, I think the DB team will take care this problem. Yulun 4/25/2001 12:50:03 PM yzhang The stats rollup succeeded. so have customer run clean_process_13775.sh script just by typing the script name, then run the nhiRollupDb. Thanks Yulun 5/3/2001 1:09:36 PM yzhang problem solved, and ticket closed j%4/23/2001 3:24:55 PM wburke - NT, 4.7.1 p01 - Lost all data on the NH server's disks. - Restored from a complete system backup - the Network Health service will not start. Database read inconsistent - force was successfuly. - save always fails in the same place: Unloading table nh_mtf . . . Unloading table nh_address . . . Unl - Obtained VerifyDb on table returned: S_DU04C0_CKING_CATALOGS VERIFYDB: beginning check of DBMS catalogs for database nethealth S_DU16CD_EXTRA_IIXPROTECT iixprotect contains a tuple with prouser = , protabbase = 10512, and tidp = 1027, but a corresponding tuple is missing from iiprotect. S_DU024E_DELETE_IIXPROT The recommended action is to delete the offending tuple from IIXPROTECT. S_DU16CD_EXTRA_IIXPROTECT iixprotect contains a tuple with prouser = , protabbase = 10512, and tidp = 1028, but a corresponding tuple is missing from iiprotect. S_DU024E_DELETE_IIXPROT The recommended action is to delete the offending tuple from IIXPROTECT. - 4/23/2001 3:34:33 PM yzhang Shut Ingres down, make a backup of the files in the ii_database location, res tart Ingres and then run verifydb in run_interactive mode. - verifydb -mruninteractive -sdbname "" -odbms_catalogs 4/24/2001 12:24:05 PM wburke obtained info. 4/24/2001 8:45:07 PM yzhang Walter verifydb in runinteractive mode takes an exclusive lock on the database. So, while you are running that, please make sure that there are no connected sessions on the database. Run verifydb and send us the iivdb.log file again. Have customer do this: login as ingres start ingres ipm server_list -> select -> ingres - session, should be no entries under the database column, if there are entries, do DBA -> delete. Then login as nhuser, run the verifydb again. Robin, correct me if someting not correct. Yulun 4/25/2001 8:07:42 PM yzhang This is the email sent to CA: finally we got the iivdb.log, Can you tell me how to delete the offending tuples from iixprotect, because these are not the usr tables. Thanks Yulun 4/26/2001 11:33:06 AM wburke -----Original Message----- From: bill.erickson@uniontrib.com [mailto:bill.erickson@uniontrib.com] Sent: Thursday, April 26, 2001 10:41 AM To: support@concord.com Subject: Ticket #48315 Walter, The nhsavedb hung at the same point it has in the past. Here is save log: Begin processing (4/25/2001 05:49:00 PM). Copying relevant files (4/25/2001 05:49:01 PM). Unloading the data into the files, in directory: 'E:/nethealth/db/save/support.tdb/'. . . Unloading table nh_alarm_rule . . . Unloading table nh_alarm_threshold . . . Unloading table nh_bsln_info . . . Unloading table nh_bsln . . . Unloading table nh_daily_exceptions . . . Unloading table nh_daily_health . . . Unloading table nh_daily_symbol . . . Unloading table nh_assoc_type . . . Unloading table nh_elem_latency . . . Unloading table nh_element_class . . . Unloading table nh_elem_outage . . . Unloading table nh_elem_alias . . . Unloading table nh_element_ext . . . Unloading table nh_elem_analyze . . . Unloading table nh_enumeration . . . Unloading table nh_exc_profile_assoc . . . Unloading table ex_tuning_info . . . Unloading table exception_element . . . Unloading table exception_text . . . Unloading table nh_exc_profile . . . Unloading table ex_thumbnail . . . Unloading table nh_list . . . Unloading table nh_list_item . . . Unloading table hdl . . . Unloading table nh_hourly_health . . . Unloading table nh_hourly_volume . . . Unloading table nh_elem_latency . . . Unloading table nh_col_expression . . . Unloading table nh_element_type . . . Unloading table nh_elem_type_enum . . . Unloading table nh_elem_type_var . . . Unloading table nh_variable . . . Unloading table nh_mtf . . . Unloading table nh_address . . 4/30/2001 1:50:28 PM wzingher This was somehow reassigned to Lemmon. Any reason? 5/14/2001 9:54:48 AM yzhang sent the iivdb.log file to CA for help 5/15/2001 3:41:58 PM wburke waiting for more infor from vendor only. 5/17/2001 4:56:14 PM yzhang 1) have them run: verifydb -mruninteractive -sdbname $NH_RDBMS_NAME -odbms_catalogs -u$ingres send me the iivdb.log 2) find ~yzhang/scripts/prob_13843.sh, and have customer run it as: prob_13843.sh > prob_13843.out The script will produce nh_address.dat file, have them keep this file. send me prob_13843.out. I want to see this file and iivdb.log before they run nhSaveDb. 5/21/2001 2:51:32 PM yzhang Walter, Since they don,t have good db save, I am worrying do "Y" will give some trouble to their database. have then do contral C to exit the verifydb. Then run prob_13843.sh (I think you already sent to them) as: prob_13843.sh > prob_13843.out The script will produce nh_address.dat file, have them keep this file. send me prob_13843.out. Thanks Yulun 5/21/2001 4:22:31 PM wburke I have prob_13843.sh running on the Nethealth server. It started at 12:22 pdt and is still running. Neither file (prob_13843.out/nh_address_test.dat) show any size changes (they are still 0KB). I just wanted to let you know since I am not sure about how fast this procedure should show a< ny progress. Bill- 5/21/2001 5:23:24 PM yzhang Walter, Find out if nh_address table is accessable by doing: select count (*) from nh_address and/or help table nh_address. Then find out the table size, as: select file_name from iifile_info where table name = 'nh_address'. go to $II_SYSTEM/ingres/data/def*/nethealth for the table size. if the table size is not big, direct them to copy nh_address table to a file by running a single sql (refer to prob_13843.sh for the command). Let me know when you reach this point. Thanks Yulun 5/22/2001 10:49:29 AM yzhang Bill, Can you run the attached script. and send us the the file named concord_addr.out, located on the directory where you run the script. I send this directly to you because Walter has not been here yet. Thanks Yulun 5/22/2001 11:43:57 AM yzhang Bill, Thanks for the quick reply, I think you can run nhSaveDb now, let us know if have any problem. Yulun 5/23/2001 3:20:12 PM yzhang run the verifydb (in interactive mode), I sent the command last time. This time, when they reach the point where asking if you want to delete the offending turple, anwser "Y" Yulun 5/23/2001 4:49:36 PM wburke -----Original Message----- From: bill.erickson@uniontrib.com [mailto:bill.erickson@uniontrib.com] Sent: Wednesday, May 23, 2001 4:46 PM To: YZhang@concord.com; WBurke@concord.com Cc: bill.erickson@uniontrib.com Subject: RE: Ticket # 13843 I ran verifydb again as directed; here is command line: verifydb -mruninteractive -sdbname nethealth -odbms_catalogs -uconcord When it asks if you want to delete offending tuple it returns the following: Aborting because of error E_US083A3 line 1, Cursor 'xpro_tst_curs_1' does not have delete permission. Bill- 5/23/2001 5:50:58 PM yzhang find out the following: 1) echo "select * from iidatabase\g" | sql iidbdb > iidbdb.out 2) echo " select table_name, num_rows, create_date from iitables or by table_name\g" | sql nethealth > iitable.out 3) echo "help \g" | sql nethealth > help.out 4) nhCreateDb test_db 5) nhSaveDb -p test_db.tdb -d test_db (see if he encouners the same problem) 4) ask him what tables (but he may not know the table name) he want. or what is the important data in the database he want. Then I can write a script to copy the table into files. Thanks 5/23/2001 7:44:15 PM wburke obtained info: on voyagerii/48000/48315/5-23-01 5/24/2001 8:52:11 AM yzhang Bill, The save.log you just attached is the saving from a empty database (test_db you created yesterday), Right. Can you dO echo "select * from iidatabase\g | sql iidbdb > new_iidbdb.out also can you tell me what kind of data in the database you want keep. I am think reconstruct your database. You can send to me directly, I think Walter would not be here until noon Thanks 5/25/2001 11:36:48 AM yzhang Bill, please do the following : 1) login as nhuser and sourcing nethealthrc.csh 2) cd $NH_HOME 3) mkdir my_test 4) copy the attached script into my_test directory 5)chmod 777 my_test 6) prob_13843.sh > prob_13843.out, send us prob_13843.out the script will reconstruct your database, and there is no data get lost during running the script. it may take several hours to complete the script depending on size of your database. Thanks 5/25/2001 1:44:15 PM yzhang Are you on the NT, right? please don't remove anything in my_test directory, and look aroun to see if you can find reload.ing file. also cd to my_test ls -l > my_test.out, send this file Thanks Yulun 5/29/2001 2:07:23 PM wburke Spoke with Yulun regarding this issue. Attempt to save dB failed. Customer has a dB save form 11/00 4.4 I believe we will re-install the application 5/29/2001 3:07:44 PM yzhang Walter, find out from customer, what's in the following directory \oping20\ingres\ckp\default\iidbdb Thanks Yulun 5/31/2001 11:36:11 AM wburke obtained debug, BAFS/48000/48281/5-31-01 6/1/2001 10:24:12 AM don Don, customer reCreated database. Closed call. THis bug is closed 6/1/2001 11:11:35 AM yzhang closed a75/3/2001 11:39:54 AM wburke NT sp06 512 RAM NH 4.7.1 p02 d03 Issue: Subject: Re: call ticket #48281 - Each time he discovers a device, the server dies It makes absolutely no difference,switch router,csu,server...Take your pick. Any time I try to discover a device, the discovery process is successful. I save the discovery to the database, and on the next poll the server stops unexpectantly. Sometimes the Nethealth service has stopped, sometimes not, but a complete reboot is required to start the server again. Stopping and starting the service doesn't work. - customer is able to duplicate at will. - Debug at BAFS/48000/48281/debug - nhiCfgSerevr - nhiDbServer 5/11/2001 11:53:14 AM wburke - Accross multiple product functions: - Element Modification - Discovery saves. 5/15/2001 5:30:22 PM dshepard From the advanced logs provided, it appears the Db Server is crashing. I say that because I don't see a clean shutdown at the end, like I do in the CfgServer log. There is a Db Server request from the poller after each config change to read the new element info. They may explain the timing of the problem. Reassigning to the Db group. 5/15/2001 7:10:14 PM wburke From: Trei, Robin It sure looks like the db server is crashing. The times I've seen it crash like that (just when it is streaming a reply) were when the data structures were mismatched because of version problems. Please get the history of when patch 1 & 2 were applied, and if anything unusual happened. Please get a ls -l of everything in $NH_HOME/bin/sys (or NT equivilent-- I want size and date). Please modify the following file to turn on even more debugging: $NH_HOME/sys/debuglog.cfg and edit to get the following: program nhiDbServer { arguments "-Dall -Dt" } Then have the customer turn on advanced logging and recreate the situation, once the server has crashed, send the log to me and reset this back. Warning: this will make an even larger log file, so customer should minimize time we are logging. ----------------------------------------------------------- REQUESTED 5/16/2001 9:40:11 AM bhinkel Info requested, so status changed to MoreInfo. 5/16/2001 3:11:00 PM wburke Could not obtain debug as nhiDbServer in -Dall Hangs the box at 100% cpu for this process All other information obtained and is on BAFS/48000/48281 5/17/2001 12:30:03 PM rtrei reviewed the list of processes. Here is when things were upgraded: rwxrwxrwx 1 Administrators None 0 Sep 13 2000 4.7.1 drwxrwxrwx 1 Administrators None 0 Jan 18 14:05 4.7.1D3 drwxrwxrwx 1 Administrators None 0 Jan 18 13:21 4.7.1P2 Here are the dates for the key applications: -rwxrwxrwa 1 Administrators None 2704896 Jul 20 2000 nhiCfgServer.exe -rwxrwxrwa 1 Administrators None 2295296 Jul 20 2000 nhiDbServer.exe -rwxrwxrwa 1 Administrators None 1828352 Jul 20 2000 nhiMsgServer.exe Not only are they all the same, they are clearly from the original 4.7.1 release. In other words, everything was working fine for several months. I got the impression from the call ticket that this problem seemed to happen in February. We need to pinpoint if customer feels it really started to happen after the patch/certification were applied, or if they remember some other event (such as changing OS parameters) that could have been the cause. (Some external event had to trigger the problem, software does not wake up one day and decide it is tired of working.) Meanwhile, we need to run the nhiDbServer with some flags on. It will take us a few iterations to get it right since we can't turn them all one. Here is what I suggest for the next pass: -Dmall -Df dtTz -Dt 5/17/2001 1:21:41 PM wburke Requested Info. Will complete by tomorrow AM. 5/17/2001 2:31:07 PM yzhang I looked at some log ( including errlog.log) and debug output from escalated directory< , I expect to see a core file, but no core file. Yes, this is a new problem to me, and I definitely work closely with you. In the errlog.log, message "association failure, partner abruptly released association" appeared repeatedly, so I am wonder if the dying of dbms server would causes nh server dying, or if the nhserver dying is due to some problems between socket communication. My plan for this problem is. 1) see if they have a core I start debugging with. 2) see if I can reproduce this problem on my system Let me your opinions. Thanks Yuun 5/18/2001 1:06:33 PM wburke obtained debug BAFS/48000/48281/debug 5/18/2001 4:45:48 PM yzhang Thanks for the reply, but I need information regarding how to upgrade to 2.0/0011 wit, and what is the latest patch. Thanks Yulun 5/18/2001 6:02:23 PM wburke Please set the attached in $NH_HOME/sys. Make a copy of the original then place attached in directory. Stop and start server Set advanced logging for the Database. Crash application. Restart. Send in $NH_HOME/log/advanced/nhiDbServer.txt 5/21/2001 4:20:15 PM yzhang The one you just sent is exactly the same as the one you sent on Friday, I think we need the one with more module and Flag, (see the email I sent to you at friday evening regarding the command. Thanks Yulun 5/22/2001 1:22:05 PM yzhang the debuglog.cfg looks right, but one I received yesterday (nhiDbServer.txt) is exctally same as the one I received last friday. Check to see if they sent you the wrong one. or they may need to run it again. Thanks Yulun 5/22/2001 1:44:38 PM wburke requested info. 5/22/2001 4:45:52 PM yzhang The file nhiDbServer.txt you just forwarwed is still exactly the same as the the one I attached here (this is the one received before we asked placing more module and flags). do you think they may did something wrong in advanced debugging. I think If they did advanced debugging yerterday or today, the date should be shown on the nhiDbServer.txt. Thanks Yulun 5/23/2001 3:00:02 PM yzhang Let try it manually run the following command from prompt using sh: 1) sh (become sh) 2)nhiDbServer -Dm dsvr:ccm:esd:cdm:cdt -Df cCdDtT -Dt > nhiDbServer10.out 2>&1 the command has been testet , and it worked. have them do this quickly to see if it works. Thanks Yulun 5/24/2001 3:29:46 PM wburke Cannot get the debug, Fails with NT error. 5/25/2001 10:21:50 AM yzhang woking on debugging nhiDbServer 5/29/2001 11:30:23 AM yzhang This sounds good, I will do some quick test, then have customer do it, see if we can get the core file. Thanks Yulun 5/29/2001 1:47:09 PM yzhang Walter, Have customer following the procedure here to collect the core file: 1) In the Control Panel->System applet set the system environment variable NUT_DUMP_CORE to yes. Stop and restart your nethealth servers. 2) do a discovery, then start polling untill it crashes 3)lookking for a file named "core.", under $NH_HOME/bin/sys, but it may be in elsewhere, and send the core file 4) unset NUT_DUMP_CORE variable. Thanks Yulun 5/29/2001 2:05:44 PM wburke Requested info. 5/31/2001 1:22:42 PM yzhang Walter, Can you ask customer follows the procedure I sent to collect the core file. the new debugging output on nhiDbServer has little bit more information regarding the schduled jobs, but it crashed in the same place where dbserver was trying to send mesage to socket. Robin you may grape more useful information from the attached debugging output. Thanks 5/31/2001 1:41:39 PM yzhang Stephen, following procedure (collect core file ) from NT is originated from your email. but customer said NUT_DUMP_CORE does not work, their dbserver was crashed but no core has been generated. I am wondering if you have any idea about this. Walter, privide more information on this if you have. Thanks Yulun 6/1/2001 1:09:37 PM wburke info provided 6/1/2001 1:10:04 PM wburke Sent NT event logs to YZhang 6/5/2001 8:41:16 AM yzhang Walter, I got some suggestions from Dave regarding this problem, can you have customer enabling advanced logging for all the major processes as described below, so we can figure out which prcess is in question. Thanks Yulun 6/5/2001 11:24:19 AM wburke requested info, again. 6/6/2001 1:59:14 PM yzhang Walter, This is our plan for handling this problem: We still want to collect the core file, failure of last time may be due to have not been reboot after setting the system varible. also we need to collect advanced logging for DbServer with module and flag described below. If there is still not very much clues from those. I will place some instrumentation on the dbserver code to trace the problem. Have customer follows the procedure here to collect the core file, and produce the debug file: make sure they reboot the system after setting the system varible. 1) In the Control Panel->System applet set the system environment variable NUT_DUMP_CORE to yes. Stop nethealth servers and ingres 2) reboot the system 3) turn on module as: -Dm dsvr, flag as: -Df dDzZ (you need to modify this from debuglog.cfg) 2) do a discovery, then start polling untill it crashes 3)looking for a file named "core.", under $NH_HOME/bin/sys, but it may be in elsewhere, and send the core file, also send the debug text 4) unset NUT_DUMP_CORE variable. Thanks 6/6/2001 2:16:25 PM wburke Requested info. 6/12/2001 11:09:04 AM yzhang Did you get anything new from customer on this one 6/12/2001 12:48:30 PM wburke customoer is going to upgrade to 4.8 6/12/2001 1:13:15 PM yzhang Have they done the upgrade, and is there same crash in nh48. hope you can keep track with them 6/12/2001 2:33:09 PM wburke same 6/13/2001 1:18:26 PM yzhang Robin, Customer did upgrade from 471 to 48, but it crashes at the same place. What you think the next step should be: try to collect the core file?, or place some instrumentation in the db server code? or reproduce the crash in house? Thanks Yulun 6/15/2001 11:45:28 AM wburke obtaining debug. 6/15/2001 1:16:11 PM wburke customer went on vaca, with no backup for a week, 6/15/2001 1:17:48 PM wburke -----Original Message----- From: Burke, Walter Sent: Friday, June 15, 2001 1:07 PM To: 'rthomp32@csc.com' Subject: Ticket # 48281 Tim, I understand that you will be on vacation for the following week. Unfortunately we need to obtain debug to fix your problem asap. Please inform if there is another person, whom I may contact during this period. Sincerely, 6/18/2001 5:28:04 PM don customer on vacation de-escalated until he returns 6/25/2001 4:21:14 PM wburke obtained info, NO dbServer Debug Created. 6/28/2001 5:01:47 PM wburke -----Original Message----- From: Burke, Walter Sent: Thursday, June 28, 2001 4:40 PM To: 'rthomp32@csc.com' Subject: Ticket # 48281 Tim, We have solved the issue of why we cannot get nhiDbServer debug. The flags were set incorrectly: Close Console cd $NH_HOME/log/advanced/ Remove all files. Replace current $NH_HOME/sys/debugLog.cfg with the attached. << File: debugLog.cfg >> Bring up console -> setup -> adv. logging Turn on advanced logging for the database, configuration, and discover Crash Server. using Discover. Restart Server Send in $NH_HOME/log/advanced/nhiDbServer.txt Send in $NH_HOME/log/advanced/nhiCfgServer.txt Send in $NH_HOME/log/advanced/nhiDiscover.txt Send in $NH_HOME/tmp/nhiCfgServer --> All files. Turn off advanced logging. 6/29/2001 8:46:36 AM bhinkel Walter sent email to customer asking for info, so changed state. 7/2/2001 1:52:25 PM rtrei Rupa-- See me if you have any questions, I will be in on Thursday. 7/3/2001 12:43:28 PM wburke Obtained Info. BAFS/48281/7-2-01 7/5/2001 2:37:01 PM rnaik What flags were turned on for the latest nhiDbServer.txt of 7/2/01 ? Can we get the cust. to reproduce the prob. with the following flags for nhiDbServer. -Dmall -Df dDcCizZ If -Dmall is n< ot possible, atleast module ccm should be specified i.e. -Dm dsvr:ccm:esd:msvr:cu -Df dDcCizZ NOTE: The module ccm and flag cCzZ are what I am looking for. So, the sequence is: stop servers modify $NH_HOME/log/debugLog.cfg to set the above module & flag for nhiDbServer start servers Turn advanced logging on for Database and Messaging. i.e. nhiDbServer and nhiMsgServer Also, how may elements are they trying to discover ?? Looks like they are reproducing this while discovering a single element ? Is that correct ? Thanks, R. 7/5/2001 4:14:43 PM wburke "-Dm cu:dsvr:ccm:dbg:tb -Df cdiOtZ -Dt" Yes reproducing with only one element. 7/5/2001 4:27:21 PM rnaik We need module ccm with flags cCzZ. i.e. -Dmall -Df dDcCizZ or -Dm dsvr:ccm:esd:msvr:cu -Df dDcCizZ Make sure adv. logging is ALSO turned on for msgServer and cfgServer as well. Do you by any chance happen to have access to memory requirements based on # of elements for 4.7 ? We should look at that to make sure we have enough memory. I've seen a somewhat similar customer issue w/ 4.7 (call# 42457) around Dec, 2000 and when they upgraded mem. - the prob. went away.... Meanwhile, lets try to get the nhiDbServer.txt generated with above flag. Thanks ! R. 7/9/2001 6:50:45 PM wburke -----Original Message----- From: rthomp32@csc.com [mailto:rthomp32@csc.com] Sent: Monday, July 09, 2001 6:07 PM To: Burke, Walter Cc: cjh@concord.com Subject: RE: 48281 Yes, we had increased memory but apparently not enough to compensate for the expanding database. I can't beleive that we have wasted so much time and overloooked this factor. You had all of the configuration information at your disposal. 7/10/2001 10:46:50 AM rnaik Memory upgrade fixed the problem. Customer increased memory from 384MB to 512 MB (replaced a 128 chip w/ 256). They had 2017 elements, 384MB RAM and 512MB swap. 45/4/2001 11:28:56 AM wburke Orginal errror: - SQL error occurred during operation (E_QE007D) Error trying to put a record. Research -NETHEALT::[32796 , 00007a68]: Mon Apr 30 10:05:54 2001 E_DM9000_BAD_FILE_ALLOCATE Disk file allocation error on database:nethealth table:nh_element pathname:/nethealthda/idb/ingres/data/default/nethealth filename:aaaaeeaj.t00 write() failed with operating system error 27 (File too large) Attempt: - tried to copy nh_element table out and re-create it. - transaction log blew in middle of script: - files on BAFS/49000/49107 SCRIPT: # echo "help table nh_element\g" | sql nethealth > nhElem.out echo "copy table nh_element() into 'nhElement.dat' \g" | sql nethealth >> nhElem.out echo "create table ref_node as select * from nh_element where element_class < 3\g" | sql nethealth >> nhElem.out echo "drop table nh_element\g" | sql nethealth >> nhElem.out echo "create table nh_element as select * from ref_node\g" | sql nethealth >> nhElem.out echo "create unique index nh_element_ix1 on nh_element ( element_id) with structure = btree\g" | sql nethealth >> nhElem.out echo "create unique index nh_element_ix2 on nh_element ( element_class, name) with structure = btree\g" | sql nethealth >> nhElem.out echo "help table nh_element\g" | sql nethealth >> nhElem.out nhElem.out: * Executing . . . E_US1262 Your transaction has been aborted due to the transaction log file having reached one of the limits set by the system administrator. These limits are log_full, force_abort, and 90 percent of force_abort when using the fast_commit option to start DBMS servers. (Thu May 3 21:22:17 2001) 5/4/2001 4:07:41 PM yzhang This is the query for creating nh_element table, you need to add two unique index after create it. You probably want to test on your system before it goes to customer. Let me know if have any question. 5/7/2001 1:08:08 PM jpoblete Yulun, as per our talk on 05/07/2001 I'm awaiting for the script to rebuild the nh_element table and reload the data from nhElement.dat 5/7/2001 2:26:29 PM yzhang Jose, attached is the script we talked about this morning. you need to do some work with this script. 1) findout from customer the absolute directory for nh_element back file, and placed into the script. 2) test it with a very small database 3) send to customer Thanks Yulun 5/8/2001 4:48:35 PM yzhang have costomer run this script with the command: nh_elem_14123.sh > nh_elem_14123.out and send us nh_elem_14123.out. we will direct them what to do next based on nh_elem_14123.out. The script was based on that they have loaded the nh_element table without index. also tell customer keep their data file for nh_element table, that is the backup in case something wrong. Thanks Yulun 5/9/2001 2:28:52 PM jpoblete Yulun, The script you send them ran OK, they do not have unreferenced nodes anymore, but the outcome is no good.... The file size is: 2138939392 => 1.99121 Gb INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. Ingres SPARC SOLARIS Version OI 2.0/9712 (su4.us5/00) login Wed May 9 13:53:48 2001 continue * Executing . . . +-------------+-------------+ |element_class|col2 | +-------------+-------------+ | 1| 27611| | 2| 2061038| +-------------+-------------+ (2 rows) continue * Your SQL statement(s) have been committed. Ingres Version OI 2.0/9712 (su4.us5/00) logout Wed May 9 13:54:32 2001 INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. Ingres SPARC SOLARIS Version OI 2.0/9712 (su4.us5/00) login Wed May 9 13:58:54 2001 continue * Executing . . . +-------------+ |col1 | +-------------+ | 4230879| +-------------+ (1 row) continue * Your SQL statement(s) have been committed. Ingres Version OI 2.0/9712 (su4.us5/00) logout Wed May 9 13:59:06 2001 INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. Ingres SPARC SOLARIS Version OI 2.0/9712 (su4.us5/00) login Wed May 9 14:00:36 2001 continue * * Ingres Version OI 2.0/9712 (su4.us5/00) logout Wed May 9 14:00:36 2001 INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. Ingres SPARC SOLARIS Version OI 2.0/9712 (su4.us5/00) login Wed May 9 14:01:37 2001 continue * Executing . . . Name: nh_element Owner: nhuser Created: 07-may-2001 18:01:08 Location: ii_database Type: user table Version: OI2.0 Page size: 2048 Cache priority: 0 Alter table version: 0 Alter table totwidth: 935 Row width: 935 Number of rows: 2088649 Storage structure: heap Compression: none Duplicate Rows: allowed Number of pages: 1044393 Overflow data pages: 1044325 Journaling: enabled after the next checkpoint Base table for view: no Permissions: none Integrities: none Optimizer statistics: none Column Information: Key Column Name Type Length Nulls Defaults Seq element_id integer 4 no no name varchar 64 no no element_class integer 4 no no element_type integer 4 no no create_dt integer 4 no no speed float 8 no yes speed1 float 8 no yes ip_address varchar 21 no yes mtf_name varchar 64 no yes index1 integer 4 no yes index2 varchar 64 no yes index3 varchar 64 no yes index4 varchar 64 no yes devic< e_speed float 8 no yes device_speed2 float 8 no yes community_string varchar 256 no yes poll_rate integer 4 no yes store_in_db integer 4 no yes unique_dev_id varchar 128 no yes nms_key varchar 128 no yes nms_state integer 4 no yes Secondary indexes: Index Name Structure Keyed On nh_element_ix1 btree element_id nh_element_ix2 btree element_class, name continue * Ingres Version OI 2.0/9712 (su4.us5/00) logout Wed May 9 14:01:38 2001 5/9/2001 3:26:39 PM yzhang asked support have customer run cleannodes_pwc.sh and cleannode_check.sh to reduce the unreferenced nodes 5/9/2001 7:14:07 PM jpoblete Yulun, we ran the scripts and when they finished, alll the nodes were wiped out, we then started the Nethealth proc's and everything came up just fine. We will keep you posted if something comes up. 5/14/2001 11:46:29 AM bhinkel Just got escalated, so updated the status back to Assigned. 5/14/2001 2:22:44 PM yzhang run this script and send me the output. also send me the conversation rollup log. 5/15/2001 11:43:05 AM cestep Customer ran the cleanNode_check.sh script, but the cleannode_check.out file seems to have no relevant information, it is on \\BAFS under ticket #49107. His rollup got hung up last night and is still running. 5/15/2001 11:51:12 AM cestep Customer resent results, there is information there. 5/15/2001 12:58:27 PM bhinkel Info provided, changed to Assigned. 5/15/2001 5:22:16 PM rkeville Customer will put nh_element.dat on the ftp server. 5/15/2001 5:50:58 PM rkeville -----Original Message----- From: Keville, Bob Sent: Tuesday, May 15, 2001 5:42 PM To: Zhang, Yulun Subject: 14123 Yulun, The file nh_element.dat was tar to create file 49107_nh_element.tar . The file 49107_nh_element.tar was ftp to ftp.concord.com/incoming in binary mode. Please advise on how to proceed. 5/16/2001 9:48:44 AM yzhang have custoemr run the attached script by using command: nh_elem_14123_1.sh > nh_elem_14123_1.out, and send me the nh_elem_14123_1.out. Before they run the script, they need to make sure they have nh_element.dat (the one you sent to me)placed in the directory where they ran the script. The script has been tested Thanks Yulun 5/16/2001 11:32:52 AM cestep Customer has run the script and sent nh_elem_14123_1.out. It is on \\BAFS, under ticket #49055. 5/16/2001 11:33:11 AM cestep Wrong ticket number, it's 49107. 5/16/2001 2:43:21 PM cestep After running the script, the customer received the following error in the console: Wednesday, 05/16/2001 08:08:27 AM Internal Error (Configuration Server) Assertion for 'elemPtr' failed, exiting (in file ../CfgServer.C, line 2507). Then the server stopped and restarted. 5/16/2001 4:55:39 PM yzhang I talked with customer, he said he don't worry about losing conversation data, but he want to make sure that the stats data to be kept and stats poller works properly. have him do the following: 1) save current database ( amke sure the dbsave is succeeded 2) run the attached script just by typing the script name, send me the concord_14123.out file. 5/21/2001 1:31:18 PM cestep Customer sent back the output file. It's on \\BAFS under ticket #49107 5/21/2001 2:13:50 PM yzhang their output indicated that they have duplicate in nh_element table, we need to take care this first. The attached script will produce the kind of duplicate in the tables (include the nh_address table), have them run the script, then send me the condord.out in the NH_HOME/tmp. they can run the script in 10 to 15 min. don't do anything untill the duplicate cleaned. Thanks 5/22/2001 2:11:29 PM yzhang The output is almost empty, did they get the concord.out from $NH_HOME/tmp?, have them check this. the concord.out should be in $NH_HOME/tmp. or rerun the script. Thanks Yulun 5/22/2001 4:04:41 PM cestep Customer verified that he retrieved the file from $NH_HOME/tmp. He wants to know what the next step is. 5/23/2001 11:43:06 AM cestep Had the customer turn the server on last night. The rollup ran, appeared successful, but Conversations Rollup log is empty. The system log indicates that the job ran and finished, but it ran the job numerous times, even though it's only scheduled to run once. Need to find out if the rollups are actually successful at this point. I can request another "help\g" output from the database. Please advise. 5/23/2001 12:06:51 PM yzhang do the following: did they scheduled conversation rollup four hour each? 1) get all scheduled job (nhschedule -list) (check with book on the command) 2) run nhDbSatus and send the output 3) disable scheduled conversation rollup, run conversation rollup manually, and send the log file. 5/24/2001 7:55:57 AM cestep After disabling the scheduled rollup, the customer realized that it was set to run every 4 hours. Here is the output of the dbStatus: Database Name: nethealth Database Size: 23303356416.00 bytes RDBMS Version: OI 2.0/9712 (su4.us5/00) Location Name Free Space Path +-------------------+------------------+---------------------------------+ | ii_database | 45872938000.00 bytes | /nethealthda/idb | +-------------------+------------------+---------------------------------+ Statistics Data: Number of Elements: 27886 Database Size: 21354635264.00 bytes Location(s): ii_database Latest Entry: 05/23/2001 02:19:46 PM Earliest Entry: 04/25/2000 12:00:00 AM Last Roll up: 05/22/2001 11:47:36 PM Conversations Data: Number of Probes: 4 Number of Nodes: 52968 As Polled: Database Size: 46563328.00 bytes Location(s): ii_database Latest Entry: 05/23/2001 02:00:00 PM Earliest Entry: 05/21/2001 10:06:54 AM Last Roll up: 05/23/2001 12:06:04 PM Rolled up Conversations: Database Size: 113287168.00 bytes Location(s): ii_database Latest Entry: 05/18/2001 12:30:00 PM Earliest Entry: 04/08/2001 12:00:00 AM Rolled up Top Conversations: Database Size: 662396928.00 bytes Location(s): ii_database Latest Entry: 05/18/2001 12:30:00 PM Earliest Entry: 05/21/2000 12:00:00 AM The customer will run the conversations rollup manually and send the results. 5/24/2001 12:09:38 PM yzhang I checked with customer, the manual conversation rollup completed successfully. the reason we saw multiple rollup from syslog is due to that they scheduled conversation rollup at each four hours, and as soon as the db backup was done, all of the waiting rollup get started. currently they are up and running, and don't have any outstanding problem. I suggest deescalate this ticket. Thanks Yulun 5/31/2001 1:08:10 PM cestep Customer verified that rollups are running correctly. Everything appears to be running fine now. 6/1/2001 2:53:19 PM don rollups running agian closing bug '5/9/2001 12:59:34 PM mgenest Request for GUI meter to be included for the database conversion during upgrade. This way the customer can know the progress of the conversion. 9/1/2001 3:19:16 AM AR_ESCALATOR Administrative change. This ticket has been created as an Enhancement Request. 5/9/2001 2:10:55 PM mmcnally Someone ran a network time utiliity that put thier system into August 2001. - Informed customer this is not supported but we would try to fix this for them. - Do not have a db save from before this problem occured. The time was changed on the server to 08/05/2001, when changed back, customer was receiving the following error: Collection time for 'AFSDERT1.SKANDIA.DE-S1/3-dlci-5-760' is earlier than the last time it was imported on a previous poll. We have the customers database and loaded it on NH 4.7.1P02 D03 and discovered a few elements, we have no problem polling them. The customer created a new "test" database, a< nd are still receiving the error. 5/16/2001 2:12:14 PM cestep We sent the customer a script to truncate the NH_STATS_POLL_INFO and NH_IMPORT_POLL_INFO tables. He did the following: -Stop Nethealth and quit the console -Run the script -Stop Ingres -Start Ingres -Start Nethealth No more polling errors. N 5/10/2001 7:47:53 AM foconnor After upgrade to 4.8 IngersDB does not finish startup after NT (warm/cold) was rebooted. If you start the DBand NH manually everything comes up OK !! Only when NT is rebooted then the Ingres startup hangs !! This is a Windows NT with the German Language installed. Bug submitted for customer sensitivity reasons When NT (SP5) is rebooted the start of Ingres service does not finish. In the event log following error messages shows up (translated from German): The description for Event ID ( 2003 ) in Source ( Ingres ) could not be found. It contains the following insertion string(s): This product/program is licensed to: CONCORD COMMUNICATIONS INC Site ID: 0167267. - the repliction of the licensing information failed because no connection to the license protocol service on Server \\D100CI0 could be established. - the description of the event-ID (2003) in source (Ingres) could not be found. It contains the following insertion-string(s): This product is licensed to CONCORD COMMUNICATIONS INC Site ID 0167267 Ingres account: -- ingres member of the Administrators group, has a password which never expires -- has the Log on as a Service right -- has the Log on as a Batch Job right -- has the Act as Part of the Operating System right. From winmsd output (excerpt): Prozessorliste: (Proccessor) 0: x86 Family 6 Model 7 Stepping 3 GenuineIntel ~499 Mhz 1: x86 Family 6 Model 7 Stepping 3 GenuineIntel ~499 Mhz 2: x86 Family 6 Model 7 Stepping 3 GenuineIntel ~499 Mhz 3: x86 Family 6 Model 7 Stepping 3 GenuineIntel ~499 Mhz ---------------------------------------------------------------- Real memory (KB) altogether: 2,096,544 available: 1.557.032 file Cache C:\pagefile.sys altogether: 2,108,416 altogether uses: 74,516 maximum value: 232.208 S:\pagefile.sys altogether: 4,193,280 altogether uses: 74,192 maximum value: 232.240 The variable NH_DB_VERSION in settings.txt shows the value "Reporting 4.7:10". The setting NH_DB_VERSION was changed to NH_DB_VERSION=Reporting 4.8:11 The install logs look ok and the ingres logs look ok. Customer had NH 4.7.1 previously and had no issues with ingres. Files can be found at: //BAFS/Escalated tickets/47000/47007 6/21/2001 12:47:36 PM rtrei Yulun-- This is a lower priority than your beta bugs 7/27/2001 9:04:12 AM yzhang Sheldon, Can you collect everything on CA_LIC, and $NH_HOME/tmp directores, CA_LIC should be located on the same drive as Oping20. Thanks Yulun 8/23/2001 11:32:36 AM yzhang closed due to no reponse from customer 3&5/18/2001 1:37:56 PM rrick Problem: Database is growing to fast. 90% of data is conversations info. Rollups are not openning up the disk very well. Environment: 2 Fast ethernet probes and 2 sniffers. 5 probes being poller, currently. 700,855 nodes As polled Conversations data set to 2 days, rollup data has been set below default. 11 gig database size. 90% Conversations data. Tried: The following is a procedure that you can execute to set the following environment variable (NH_UNREF_NODE_LIMIT) from 30 days(default) to 20 days to fool the Dialog poller into believing that it is the first run of the day.: NH_UNREF_NODE_LIMIT = 30 NOTE: This variable will prevent the second class level Conversations tables in the database from filling up with data over a certain amount of days, whatever is specified here in the environment variable, and move them down to the third class level tables in the database so they will not populate the the Nethealth console (Conversations poller) when it comes up.ler. Default = 30. How long do you need to see this data? This variable is set according to how long you need this data. This will roll all data older than 20 days off the database. If we need to change this variable to 15 or even 10 days we can if we need to roll off more data. ----------------------------------------------------------------------------------------------------------------------- Instructions: Please perform the following: 1. Please download or FTP this attached file of the "resetRunStep.sh" script to the $NH_HOME directory. NOTE: If you are using FTP to download this file to a Unix box from a Windows NT box please make sure to FTP in binary format. NOTE: This file is NOT in zipped-up or compressed format. 2. Login into the Network Health System as the nethealth user. 3. Change the following Nethealth Environment Variables in the $NH_HOME/nethealthrc.csh, $NH_HOME/nethealthrc.sh, and $NH_HOME/nethealthrc.ksh (only if you have one and are running Network Health in the Korn Shell): C-Shell Config: setenv NH_UNREF_NODE_LIMIT = 20 Bourne & Korn Shell Config: export NH_UNREF_NODE_LIMIT; NH_UNREF_NODE_LIMIT=20 NOTE: It is not recommended to make changes to the actual Nethealth Environment Variables in the $NH_HOME/nethealthrc.csh, $NH_HOME/nethealthrc.sh, and $NH_HOME/nethealthrc.ksh file. What is recommended is to copy the lines in these files and paste them to the $NH_HOME/nethealthrc.csh.usr, $NH_HOME/nethealthrc.sh.usr, and $NH_HOME/nethealthrc.ksh.usr (only if you have one and are running Network Health in the Korn Shell) and then change the values. The $NH_HOME/nethealthrc.csh.usr, $NH_HOME/nethealthrc.sh.usr, and $NH_HOME/nethealthrc.ksh.usr (only if you have one and are running Network Health in the Korn Shell) will override the $NH_HOME/nethealthrc.csh, $NH_HOME/nethealthrc.sh, and $NH_HOME/nethealthrc.ksh environment variable values. 4. source the $NH_HOME/nethealthrc.csh, $NH_HOME/nethealthrc.sh, and $NH_HOME/nethealthrc.ksh (only if you have one and are running Network Health in the Korn Shell) files. 3. Execute the "resetRunStep.sh" script in the $NH_HOME directory: ./resetRunStep.sh NOTE: This will put files in the $NH_HOME/tmp directory: run_step.dat and run_schedule.dat. Please reference them for errors. 4. Bring up the Network Health server and console. 5. Execute a manual roll-up from the command line in the $NH_HOME/bin/sys directory: nhiRollupDb -u -d 6. Does this free up space in the database? Issue: When I had this customer try the above procedure to try to fool the rollups into believing that it was being run for the first time that day, he received a hidden rollup failure. It was suggested by Yulun to get the following info...... 1)send rollup logfile, 2) find out their transaction lof size 3) send errlog.log 4) output of nhDbStatus 5) send syslog 5/21/2001 10:17:58 AM rrick -----Original Message----- From: Novak Robert K CONT DLVA [mailto:NovakRK@nswc.navy.mil] Sent: Friday, May 18, 2001 1:34 PM To: 'Rick, Russell' Subject: RE: Call Ticket #48503 Russell, I have a question. Even though the roll-ups are not working, the data that is being stored in the database should be valid data? We are down to 1.24G. If we need to remove the probes from the conversation poller and clear the conversation side of the database, what is the best way? I am worried that once I remove the probes, the database size may not change. Can the conversations stored in the database be removed without removing statistical data? Thanks, Robert -----Original Message----- From: Rick, Russell [mailto:RRick@concord.com] Sent: Friday, May 18, 2001 3:03 PM To: 'Novak Robert K CONT DLVA' Subject: RE: Call Ticket #48503 Robert, If you are going to go that route then do you want to keep any history on those probe elements you have already collected? Russell K. Rick -----Original Message----- From: Novak Robert K CONT DLVA [mailto:NovakRK@nswc.navy.mil] Sent: Friday, May 18, 2001 3:41 PM To< : 'Rick, Russell' Subject: RE: Call Ticket #48503 No, we are not concerned with losing conversation data. What we need is a couple of weeks of data so that we may baseline a new web cache. We are generating reports on Mondays for a week. If we can get that report on Monday then clear out the database, we will have the necessary time to complete our test as well as keep a consistent report. If you have the time to answer this or give me a call today, I would appreciate it. Thanks, Robert Novak T11/NCI Network Operations Naval Surface Warfare Center (540) 653-7178 novakrk@nswc.navy.mil 5/21/2001 2:56:17 PM rrick -----Original Message----- From: Rick, Russell Sent: Monday, May 21, 2001 2:47 PM To: 'novakrk@nswc.navy.mil' Subject: RE: Call Ticket #48503 & Problem Ticket #14519 Hi Robert, Comments: This procedure will save off a full copy of the database, then it initialize all TA-related tables in the database, truncate all the configuration tables and drops all the data tables. Instructions: 1. Please make sure you have a good full Network Health database backup created by the nhSaveDb utility, before performing the rest of this set of instructions. Not a Checkpoint save!!!! nhSaveDb -p path -u user database name NOTE: PLEASE CHECK nhSaveDb LOGS BEFORE GOING FORWARD!!!!!! 2. Please download or FTP this attached file of the "dropDlg.sh" script to the $NH_HOME directory. NOTE: If you are using FTP to download this file to a Unix box from a Windows NT box please make sure to FTP in binary format. NOTE: This file is NOT in zipped-up or compressed format. 3. Login as the nethealth user. 4. Update the environment by performing the following from the command line in the $NH_HOME directory: source $NH_HOME/nethealthrc.csh 5. Execute the "dropDlg.sh" script from the command line in the $NH_HOME directory: 6. If you still experience any issues please contact support@concord.com, attn: Russ Rick. Regards, Russell K. Rick 5/23/2001 9:55:07 AM rrick -----Original Message----- From: Novak Robert K CONT DLVA [mailto:NovakRK@nswc.navy.mil] Sent: Tuesday, May 22, 2001 4:16 PM To: 'Rick, Russell' Subject: RE: Call Ticket #48503 & Problem Ticket #14519 Russell, The script worked great! No longer do we see the 1969 for the earliest entry under the database status. I am assuming that we still have a roll-up problem, but this has bought us some time. Thanks again, Robert Spoke with Robert: - Since he does not have any more dlg data there is not rollup problem. - Please collect new data for a few days and then try rolling up. - Please check rollup logs and let me know if there is a problem. 5/31/2001 1:09:49 PM rrick -----Original Message----- From: Novak Robert K CONT DLVA [mailto:NovakRK@nswc.navy.mil] Sent: Thursday, May 31, 2001 10:35 AM To: 'Rick, Russell' Subject: RE: Call Ticket #48503 & Problem Ticket #14519 Russell, This morning, after checking the database status, the rollup had done its job. I guess that I do not understand the rollup. I have as polled set for two days. I would have thought that every two days that data should roll up and show it on the status window. What I guess I need to do is monitor for the next couple of weeks and get a better understanding of what is happening. If the rollup is working properly, I would like to comment out the varialbes(NH_UNREF_NODE_LIMIT, NH_POLL_DLG_BPM) that I placed in nethealthrc.sh.usr. My goal would be to not only return the default setting, but to remove the fail on the Database Status window. This will prevent future headache with the customer. If this causes an adverse effect I will uncomment them. We have 11G of drive space left. If a large amount of drive space can be maintained, I will enlarge the transaction log to 2G. I do not want this to be a never ending trouble ticket, but as far as the rollup settings, is there any way to modify them more than what they are. For example, the Roll up Conversations has 4 Hour Samples 4 days by default. The 4 for days can be modified. Can the 4 for Hour Samples be modified to say, 1 Hour Samples. Please let me know what your thoughts are? If the above statements make absolutely no sense, please feel free to let me know. Thank you, Robert Spoke with Robert: - Cannot set under a 4 hr. rollups. - everything is working ok now. 5/18/2001 6:58:01 PM cpaschal The database was saved on NH 4.7.1 p2/d3 running on HPUX 10.20. This db save was loaded onto a fresh installation of NH 4.7.1 (with no patches) on HPUX 10.20. The load succeeded, but had errors: Non-Fatal database error on object: NH_DAILY_SYMBOL 25-Apr-2001 18:49:08 - Database error: -39100, E_QE0083 Error modifying a table. (Wed Apr 25 18:49:08 2001) errlog.log showed: FF01U86 ::[1047 , 40af6680]: Mon Apr 30 01:00:26 2001 E_DM005D_TABLE_ACCESS_CONFLICT Table access conflict. FF01U86 ::[1047 , 40af6680]: Mon Apr 30 01:00:26 2001 E_PS0D20_QEF_ERROR An error occurred calling QEF while destroying view or table. FF01U86 ::[1047 , 40af6680]: Mon Apr 30 01:00:26 2001 E_SC0215_PSF_ERROR Error returned by PSF. FF01U86 ::[1047 , 40af6680]: Mon Apr 30 01:00:26 2001 E_PS0007_INT_OTHER_FAC_ERR PSF detected an internal error when calling other facility. FF01U86 ::[1047 , 40af6680]: Mon Apr 30 01:00:26 2001 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. Associated error messages which provide more detailed information about the problem can be found in the error log, II_CONFIG:errlog.log This appeared to be a problem with a table hitting the 2 GB table limit on: -rwx------ 1 ingres users 2025955328 May 2 00:56 aaaaaddn.t00 Customer started with 9 GB database and 2 service profiles Rollup schedule was reduced as follows: as polled 6 days hourly 105 weeks 52 weeks daily 105 weeks 52 weeks After running rollups from 11/1999 through 5/2001, database size decreased to 6 GB. However, table size this time was: --rwx------ 1 ingres users 2052431872 May 15 00:42 aaaaaddn.t00 AE Steve McKellar sent in nhCollectCustData and nhiIndexDiag output. These have been saved to \\bafs\escalated tickets\49000\49040\may18 5/21/2001 12:36:25 PM yzhang following message in load.log can be ignored, because no table nh_stats0_990140399 existed in the save directory. Error: Uncompress of file /pr01uc26/app/nh/daily.tdb/nh_stats0_990140399 failed. But their nh_daily_smbol table is reaching 2gb (currently it is about 1.7GB), get the following: nh_daily_health has 1.2 GB 1) echo "help table nh_daily_symbol\g" | sql $NH_RDBMS_NAME > dailt_symbol.out 2) find out how long they kept the DA data, if they kept too long, have them reduce it. (check on db worksheet about this) 3) check to see if they willing to upgrade to nh48, which has better mechanism to handle big DAC table. Thanks 5/22/2001 10:59:44 AM yzhang Christine, How is customer doing on this one, I requested some information yesterday, did they get the information for you? Thanks Yulun 5/23/2001 4:17:10 PM yzhang Check with customer to see what's their problem now. If their system is running with minor problem, this ticket should be deescalated. Thanks Yulun 5/23/2001 4:22:03 PM cpaschal I just left voice mail for the AE working this issue to find out if the customer's system is working. I will let you know what I find out. Thanks for all your help, Chris 5/24/2001 9:01:25 AM yzhang problem solved after adding disk space b5/29/2001 10:44:57 AM foconnor Data Analysis is failing complaining that the nh_daily_exceptions table is a heap and not a btree. The dataAnalysis has been failing for weeks. Recently the nhCleanDupStats.sh script was run and a script to modify the nh_daily_health table from a heap to the correct structure of btree with unique keys. The nh_daily_health table was corrected to btree and now the nh_daily_e< xceptions table is showing heap instead of btree. Spoke to Yulun and Yulun said to log a bug and collect the output with the nhCustCollectData script. Awaiting output. 5/30/2001 6:24:08 AM foconnor ====================================================== Files can be found (output of nhCustCollectData: //BAFS/ecalated tickets/49000/49018 ====================================================== Data analysis failing on the nh_daily_exceptions table. ----- Job started by Scheduler at '30/05/2001 01:00:35'. ----- ----- $NH_HOME/bin/sys/nhiDataAnalysis -u $NH_USER -d $NH_RDBMS_NAME -Dlog -DlogPath /u/nh/log/advanced/ ----- Begin processing (30/05/2001 01:00:36). Error: Unable to execute 'MODIFY nh_daily_exceptions TO MERGE' (E_US1595 MODIFY: nh_daily_exceptions: table is not a btree; only a btree table can be modified to merge. (Tue May 29 12:00:23 2001) ). ----- Scheduled Job ended at '30/05/2001 02:00:27'. Rollups are fine 5/30/2001 6:26:20 AM foconnor -----Original Message----- From: O'Connor, Farrell Sent: Wednesday, May 30, 2001 6:17 AM To: Zhang, Yulun Cc: O'Connor, Farrell Subject: Problem ticket 14865; Call ticket 49018 Data Analysis failing Yulun, The output of the nhCustCollectData can be found: //BAFS/escalated tickets/49000/49018/May30 5/30/2001 10:31:37 AM yzhang Farrell, Have customer run prob_14864.sh just typing the script name, then run cleanStats_mod.sh by typing: cleanStats_mod.sh clean the both script has been tested, after these they can run data anaysis, if they still encounter problem, get the following: echo "select table_name, storage_structure from iitables order by table_name\g" | sql $NH_RDBMS_NAME > iitable.out Thanks Yulun 6/5/2001 5:30:25 AM foconnor -----Original Message----- From: O'Connor, Farrell Sent: Tuesday, June 05, 2001 5:20 AM To: Zhang, Yulun; Lemmon, Jim Cc: Trei, Robin; Chapman, Sheldon Subject: Problem ticket 14865 Importance: High Yulun, I had him run the 14864.sh just typing the script name, then run cleanStats_mod.sh by typing: cleanStats_mod.sh clean and his dataAnalysis is still failing. The output of the below command can be found //BAFS/tickets/49000/49018 echo "select table_name, storage_structure from iitables order by table_name\g" | sql $NH_RDBMS_NAME > iitable.out Regards, Farrell 6/5/2001 5:30:39 AM foconnor - 6/5/2001 11:08:18 AM yzhang After the run cleanStats_mod.sh, there should be a file called concordClean.out under the directory where they ran the script. can you get this file. also they shoul use clean as the argument for running cleanStats_mod.sh. check to see if they applied the argument. Thanks Yulun Thanks 6/6/2001 8:59:57 AM yzhang did customer run cleanstats_mod.sh I sent yesterday? if thet have run it, I need concordclean.out, and list of tablesas: echo "select table_name, storage_structure from iitables order by table_name\g" | sql $NH_RDBMS_NAME > table.out 6/6/2001 9:32:23 AM yzhang I tested the script before I sent yesterday. the concordClean.out is not in the tmp directory, it is under the directory where they run the script, have customer search for the file. You can do a test with my script on your system, thus you know what the script is doing Yulun 6/7/2001 8:03:11 AM foconnor Received data from customer and have forwarded to yulun. 6/7/2001 9:21:39 AM yzhang Farrell, Have customer do stats rollup, then data Analysis. if data Analysis fails due to either duplicate or btree problem, semd me the log file, and iitable.out like the one you attached. Thanks Yulun 6/7/2001 1:30:30 PM yzhang Have them run: ./nhiIndexDb -d $NH_RDBMS_NAME -u $NH_USER, then run data analysis. It should work. Thanks Yulun 6/7/2001 3:49:12 PM foconnor nhreport$ nhiIndexDb -d $NH_RDBMS_NAME -u $NH_USER Creating the Table Structures and Indices . . . Non-Fatal database error on object: NH_DAILY_EXCEPTIONS 08-Jun-2001 05:28:06 - Database error: -33000, E_US1591 MODIFY: table could no t be modified because rows contain duplicate keys. (Thu Jun 7 15:28:06 2001) Non-Fatal database error on object: NH_HOURLY_VOLUME 08-Jun-2001 05:30:08 - Database error: -33000, E_US1591 MODIFY: table could no t be modified because rows contain duplicate keys. (Thu Jun 7 15:30:08 2001) Non-Fatal database error on object: NH_DAILY_SYMBOL 08-Jun-2001 05:32:34 - Database error: -33000, E_US1591 MODIFY: table could no t be modified because rows contain duplicate keys. (Thu Jun 7 15:32:33 2001) Creating the Table Structures and Indices for sample tables . . . Index of database 'nethealth' for user 'emc' was unsuccessful. 6/7/2001 6:16:59 PM yzhang have them run the attached script just by typing the script name, after running the script, three data files and a concord.out file will be produced under the directory where they run the script. have them keep the three data file in safe place and send you the concord.out. Thanks Yulun 6/8/2001 5:40:45 AM foconnor Script was run and the concord.out file was sent to Yulun. Also found at: //BAFS/tickets/49000/49018/June07/phase3 6/8/2001 8:30:08 AM yzhang The script was executed successfully, they can run the data analysis now. Thanks Yulun 6/11/2001 9:11:26 AM yzhang update to more info 6/12/2001 9:04:28 AM yzhang The customer says he ran dataAnalysis and now all is well. Ticket closed =6/6/2001 2:42:16 PM jnormandin Problem: Customer is running NH 4.8 ( unpatched ) on NT 4.0 SP5, when he runs a dbStatus ( either via GUI or command line ) the command hangs and then eventually returns incorrect data. Eg db size 0.0 latest entry 6/5/2001 earliest entry 6/5/2001 last rollup 6/5/2001 I had the customer run the db status via the command line utilizing the -Dall debug option. The debug revealed that the db status executable hung at the second transaction level 1 begin. ( confirmed by comparing my own debug run against the customer's debug log file. ) Given that information, I decided to run a SQL command trace on the executable, utilizing the ING_SET "set printqry" Ingres debug tool. The output from this debug showed that the customer's dbStatus command completed the same number of sql statements and finished in the same manner as my own Ingres debug run. ( Confirmed by comparing in-house results with those returned from the customer. ) I then examined the ingres error log to identify any possible errors or issues which coincided with the run of the dbStatus command. There were infact multiple issues occuring at that time. The entry from the log file is as follows: AS_SISCO::[II\INGRES\1dd , ffffffff]: Wed Jun 06 11:00:38 2001 E_SC0129_SERVER_UP Ingres Release II 2.0/9808 (int.wnt/00) Server -- Normal Startup. AS_SISCO::[II\INGRES\1dd , 000001ad]: Wed Jun 06 11:06:03 2001 E_GC0001_ASSOC_FAIL Association failure: partner abruptly released association 000001df Wed Jun 06 11:09:04 2001 E_DM9042_PAGE_DEADLOCK Deadlock encountered locking page 1 for table iirelation in database nethealth with mode 5. Resource held by session [477 236]. 000001df Wed Jun 06 11:09:04 2001 E_DM0042_DEADLOCK Resource deadlock. 000001df Wed Jun 06 11:09:04 2001 E_QE002A_DEADLOCK Deadlock detected. This seems odd as I would have expected the deadlock errors to hinder the results of the SQL querries run by the executable, but judging from the SQL trace this is not the case. The customer has run maintenace on a regular basis ( last time was Sunday, June 3rd ) and there were no errors encountered at that time. I also has the customer manually run maintenance via command line, but this did not have any affect on the dbStatus situation. There is a customer sensitivity factor involved, as the customer believs that the above issues may also contribute to a data integrity problem. I did my best to alleviate those issues by stating that all other aspects of the DB IO seem to occur wthout error ( Polling, reports etc..) All relevant files are located on \\Bafs\t< ickets\50000\50444 6/6/2001 4:44:22 PM yzhang have customer try these: 1) stop and restart nhserver and ingres 2) run nhiDbMaint.exe 3) as ingres run sysmod dbname 4) then nhDbStatus, if the same thing happens, do the following : a) run nhDbstatus with advanced debug on, and send the debug output file b) ingsetenv II_EMBED_SET printqry from bourne shell, run nhDBstatus, and collect iipryqry.log from the directory where they run nhDbstatus Thanks Yulun 6/6/2001 4:59:35 PM jnormandin - Information has already been given ( see above ) 6/7/2001 10:50:22 AM yzhang How is the sysmod dbname, can you have customer do this, and send me the output. 6/7/2001 10:57:28 AM yzhang an issue has been created with CA regarding the deadlock on iirelation 6/7/2001 2:58:52 PM yzhang Dror, I am engineer working with Jason on your dbSatatus problem, I write to directly because Jason is in training today. Can you do the following: 1)login as nhuser and source nethealthrc.csh 2) then login as ingres, start ingres if it does not start 3) sysmod $NH_RDBMS_NAME > sysmod.out, and send me the sysmod.out also give me your phone number if you don't mind Thanks Yulun 6/11/2001 4:34:31 PM yzhang John, The sysmod looks good, have customer try nhDbStatus nethealth again. Thanks Yulun 6/11/2001 4:56:23 PM yzhang sysmod is to reconstruct the system catlog, the reboot will not hurt, but it is not required. they may need to stop then restart ingres, then run nhDbStatus. I will determine what to do next based on the output of dbStatus. Thanks Yulun 6/12/2001 1:05:35 PM yzhang Jason, can you check with customer regarding their status, This ticket should be de-escalated if we still not hear anything from them this afternoon Thanks Yulun 6/12/2001 3:31:58 PM yzhang Jason, based on the result customer sent, they are up and running, the execution of nhDbStatus is perfect, the database size is 0 from the output just means that they have no conversation data, and they don't have to worry about this. Sheldon, this ticket will be closed, at least de-escalate it. Thanks Yulun 6/12/2001 4:42:02 PM yzhang the nhDbStatus is working now, the ticket closed HI6/7/2001 3:26:09 PM jpoblete Customer: MCIWCOM Customer recovered a inconsistent DB with a DB save which was no good, they end up loosing data from May 8th to June 6th. The production system is running right now, they are polling 30,000 elements this is a critical server. The found in one of their directories a Database save which contained a backup of nh_stats0 and nh_stats1 tables which they are willing to recover. Discussed this with Yulun Zhang, seems this task is doable, but we do need to review the data they want to insert The files they want to recover the statistics data are: -rw-r--r-- 1 ingres staff 2758742 May 23 17:56 nh_stats0_990334799.Z -rw-r--r-- 1 ingres staff 2766472 May 23 17:48 nh_stats0_990338399.Z -rw-r--r-- 1 ingres staff 2654769 May 23 18:36 nh_stats0_990341999.Z -rw-r--r-- 1 ingres staff 2717975 May 23 18:09 nh_stats0_990345599.Z -rw-r--r-- 1 ingres staff 2697905 May 23 16:42 nh_stats0_990349199.Z -rw-r--r-- 1 ingres staff 2696357 May 23 16:57 nh_stats0_990352799.Z -rw-r--r-- 1 ingres staff 2699115 May 23 16:57 nh_stats0_990356399.Z -rw-r--r-- 1 ingres staff 2680565 May 23 16:48 nh_stats0_990359999.Z -rw-r--r-- 1 ingres staff 2669491 May 23 16:23 nh_stats0_990363599.Z -rw-r--r-- 1 ingres staff 2727304 May 23 18:05 nh_stats0_990367199.Z -rw-r--r-- 1 ingres staff 2673533 May 23 18:06 nh_stats0_990370799.Z -rw-r--r-- 1 ingres staff 2707123 May 23 17:32 nh_stats0_990374399.Z -rw-r--r-- 1 ingres staff 2709851 May 23 17:11 nh_stats0_990377999.Z -rw-r--r-- 1 ingres staff 2700123 May 23 18:36 nh_stats0_990381599.Z -rw-r--r-- 1 ingres staff 2672127 May 23 16:24 nh_stats0_990385199.Z -rw-r--r-- 1 ingres staff 2687227 May 23 17:17 nh_stats0_990388799.Z -rw-r--r-- 1 ingres staff 2683845 May 23 17:09 nh_stats0_990392399.Z -rw-r--r-- 1 ingres staff 2683359 May 23 16:18 nh_stats0_990395999.Z -rw-r--r-- 1 ingres staff 2687757 May 23 16:44 nh_stats0_990399599.Z -rw-r--r-- 1 ingres staff 1939855 May 23 18:09 nh_stats0_990403199.Z -rw-r--r-- 1 ingres staff 2683927 May 23 17:31 nh_stats0_990406799.Z -rw-r--r-- 1 ingres staff 2723266 May 23 17:20 nh_stats0_990410399.Z -rw-r--r-- 1 ingres staff 2625642 May 23 17:02 nh_stats0_990413999.Z -rw-r--r-- 1 ingres staff 2697985 May 23 16:24 nh_stats0_990417599.Z -rw-r--r-- 1 ingres staff 2739944 May 23 18:07 nh_stats0_990421199.Z -rw-r--r-- 1 ingres staff 2641592 May 23 17:41 nh_stats0_990424799.Z -rw-r--r-- 1 ingres staff 2865704 May 23 16:58 nh_stats0_990428399.Z -rw-r--r-- 1 ingres staff 2788107 May 23 17:48 nh_stats0_990431999.Z -rw-r--r-- 1 ingres staff 2761405 May 23 16:23 nh_stats0_990435599.Z -rw-r--r-- 1 ingres staff 2774813 May 23 18:09 nh_stats0_990439199.Z -rw-r--r-- 1 ingres staff 2739016 May 23 16:46 nh_stats0_990442799.Z -rw-r--r-- 1 ingres staff 2781800 May 23 16:52 nh_stats0_990446399.Z -rw-r--r-- 1 ingres staff 2868848 May 23 16:28 nh_stats0_990449999.Z -rw-r--r-- 1 ingres staff 2322885 May 23 16:58 nh_stats0_990453599.Z -rw-r--r-- 1 ingres staff 3042243 May 23 16:45 nh_stats0_990457199.Z -rw-r--r-- 1 ingres staff 2917633 May 23 17:34 nh_stats0_990460799.Z -rw-r--r-- 1 ingres staff 2883098 May 23 17:57 nh_stats0_990464399.Z -rw-r--r-- 1 ingres staff 2842851 May 23 18:36 nh_stats0_990467999.Z -rw-r--r-- 1 ingres staff 2862029 May 23 18:02 nh_stats0_990471599.Z -rw-r--r-- 1 ingres staff 2881789 May 23 16:57 nh_stats0_990475199.Z -rw-r--r-- 1 ingres staff 2852874 May 23 18:06 nh_stats0_990478799.Z -rw-r--r-- 1 ingres staff 2765743 May 23 16:27 nh_stats0_990482399.Z -rw-r--r-- 1 ingres staff 2772582 May 23 17:31 nh_stats0_990485999.Z -rw-r--r-- 1 ingres staff 2780588 May 23 16:45 nh_stats0_990489599.Z -rw-r--r-- 1 ingres staff 2751024 May 23 17:56 nh_stats0_990493199.Z -rw-r--r-- 1 ingres staff 2758958 May 23 16:27 nh_stats0_990496799.Z -rw-r--r-- 1 ingres staff 2772283 May 23 16:21 nh_stats0_990500399.Z -rw-r--r-- 1 ingres staff 2771079 May 23 17:10 nh_stats0_990503999.Z -rw-r--r-- 1 ingres staff 2469065 May 23 16:48 nh_stats0_990507599.Z -rw-r--r-- 1 ingres staff 2826389 May 23 17:18 nh_stats0_990511199.Z -rw-r--r-- 1 ingres staff 2737073 May 23 16:25 nh_stats0_990514799.Z -rw-r--r-- 1 ingres staff 2837957 May 23 17:51 nh_stats0_990518399.Z -rw-r--r-- 1 ingres staff 2778566 May 23 16:44 nh_stats0_990521999.Z -rw-r--r-- 1 ingres staff 2823397 May 23 18:36 nh_stats0_990525599.Z -rw-r--r-- 1 ingres staff 2803827 May 23 16:19 nh_stats0_990529199.Z -rw-r--r-- 1 ingres staff 2812481 May 23 17:57 nh_stats0_990532799.Z -rw-r--r-- 1 ingres staff 2851189 May 23 17:41 nh_stats0_990536399.Z -rw-r--r-- 1 ingres staff 2948539 May 23 16:49 nh_stats0_990539999.Z -rw-r--r-- 1 ingres staff 2945199 May 23 18:11 nh_stats0_990543599.Z -rw-r--r-- 1 ingres staff 2933207 May 23 16:58 nh_stats0_990547199.Z -rw-r--r-- 1 ingres staff 2919967 May 23 16:57 nh_stats0_990550799.Z -rw-r--r-- 1 ingres staff 2898691 May 23 17:19 nh_stats0_990554399.Z -rw-r--r-- 1 ingres staff 2334847 May 23 17:48 nh_stats0_990557999.Z -rw-r--r-- 1 ingres staff 2372402 May 23 18:03 nh_stats0_990561599.Z -rw-r--r-- 1 ingres staff 16924449 May 23 17:21 nh_stats1_986788799.Z -rw-r--r-- 1 ingres staff 17607791 May 23 17:11 nh_stats1_986875199.Z -rw-r--r-- 1 ingres staff 17879085 May 23 18:31 nh_stats1_986961599.Z -rw-r--r-- 1 ingres staff 18126237 May 23 18:28 nh_stats1_987047999.Z -rw-r--r-- 1 ingres staff 18154285 May 23 18:34 nh_stats1_987134399.Z -rw-r--r< -- 1 ingres staff 17704422 May 23 18:25 nh_stats1_987220799.Z -rw-r--r-- 1 ingres staff 17603887 May 23 17:43 nh_stats1_987307199.Z -rw-r--r-- 1 ingres staff 17436299 May 23 17:41 nh_stats1_987393599.Z -rw-r--r-- 1 ingres staff 17966755 May 23 17:25 nh_stats1_987479999.Z -rw-r--r-- 1 ingres staff 17835834 May 23 18:02 nh_stats1_987566399.Z -rw-r--r-- 1 ingres staff 17904617 May 23 16:47 nh_stats1_987652799.Z -rw-r--r-- 1 ingres staff 17477005 May 23 17:33 nh_stats1_987739199.Z -rw-r--r-- 1 ingres staff 16673657 May 23 18:27 nh_stats1_987825599.Z -rw-r--r-- 1 ingres staff 16026089 May 23 16:20 nh_stats1_987911999.Z -rw-r--r-- 1 ingres staff 15839732 May 23 16:52 nh_stats1_987998399.Z -rw-r--r-- 1 ingres staff 16626417 May 23 17:15 nh_stats1_988084799.Z -rw-r--r-- 1 ingres staff 16723867 May 23 18:35 nh_stats1_988171199.Z -rw-r--r-- 1 ingres staff 16553433 May 23 17:56 nh_stats1_988257599.Z -rw-r--r-- 1 ingres staff 2305 May 23 16:23 nh_stats1_988343999.Z -rw-r--r-- 1 ingres staff 16663255 May 23 18:00 nh_stats1_988430399.Z -rw-r--r-- 1 ingres staff 15943092 May 23 18:06 nh_stats1_988516799.Z -rw-r--r-- 1 ingres staff 15967565 May 23 17:44 nh_stats1_988603199.Z -rw-r--r-- 1 ingres staff 17399441 May 23 16:27 nh_stats1_988689599.Z -rw-r--r-- 1 ingres staff 18287713 May 23 17:17 nh_stats1_988775999.Z -rw-r--r-- 1 ingres staff 18261957 May 23 17:27 nh_stats1_988862399.Z -rw-r--r-- 1 ingres staff 18943690 May 23 17:36 nh_stats1_988948799.Z -rw-r--r-- 1 ingres staff 19215695 May 23 17:39 nh_stats1_989035199.Z -rw-r--r-- 1 ingres staff 18850859 May 23 18:11 nh_stats1_989121599.Z -rw-r--r-- 1 ingres staff 18788223 May 23 16:44 nh_stats1_989207999.Z -rw-r--r-- 1 ingres staff 19461681 May 23 16:50 nh_stats1_989294399.Z -rw-r--r-- 1 ingres staff 19438897 May 23 17:50 nh_stats1_989380799.Z -rw-r--r-- 1 ingres staff 19417095 May 23 17:29 nh_stats1_989467199.Z -rw-r--r-- 1 ingres staff 18230275 May 23 18:04 nh_stats1_989553599.Z -rw-r--r-- 1 ingres staff 18725891 May 23 17:31 nh_stats1_989639999.Z -rw-r--r-- 1 ingres staff 18577689 May 23 17:13 nh_stats1_989726399.Z -rw-r--r-- 1 ingres staff 18506581 May 23 17:01 nh_stats1_989812799.Z -rw-r--r-- 1 ingres staff 19160844 May 23 17:47 nh_stats1_989899199.Z -rw-r--r-- 1 ingres staff 19225955 May 23 17:37 nh_stats1_989985599.Z -rw-r--r-- 1 ingres staff 19129253 May 23 17:52 nh_stats1_990071999.Z -rw-r--r-- 1 ingres staff 19250173 May 23 17:54 nh_stats1_990158399.Z -rw-r--r-- 1 ingres staff 19354165 May 23 18:08 nh_stats1_990244799.Z 6/11/2001 3:13:43 PM yzhang Can you check with customer why and how they saved the satats0 as .gz file instead of .zip file. I can do gunzip on the gz file, but can not copy into the tables due to the unexpected end of file error. do they the stats files (the files they want to load bck) in *.zip save. Thanks Yulun 6/12/2001 9:14:56 AM yzhang Jose, Can you have customer send the *.zip files (the file they want to load back to database) as soon as possible. Thanks Yulun 6/12/2001 3:00:41 PM jpoblete The file in on the FTP directory: ftp://ftp.concord.com/incoming/mciwcom/data1.tar 6/12/2001 6:12:27 PM yzhang can not copy the files into tables due to unexpected end of file error, can you have customer try to load the nightly_save.tdb into database. Thanks Yulun 6/13/2001 1:05:02 PM yzhang Jose, this is good, after loading finish, collect: echo "select table_name from iitables where table_name like '%stats%' order by table_name\g" | sql nethealth_test > table.out. with this we will know which stats table get loaded. Thanks Yulun 6/15/2001 11:31:59 AM schapman Yulun the customer tried to load into the nethealth_test database and they got the same error. Begin processing (06/15/2001 09:12:38). Copying relevant files (06/15/2001 09:12:39). Error: '/opt/neth/db/save/nightly_save.tdb/nvr_b23' is not a file name. Load of database 'nethealth_test' for user 'neth' was unsuccessful. Error: '/opt/neth/db/save/nightly_save.tdb/nvr_b23' is not a file name. Error: The program nhiLoadDb failed. The load.log is on BAFS 6/15/2001 1:40:00 PM yzhang Customer called me about his problem. The database he was loading from misses some files, including one appear in the load.log. Now he is loading another dbsave, which has most of the required files. Yulun 6/18/2001 9:13:32 AM yzhang Jim, I was trying to recover your data (data2.tar from ftp.concord.com) by using the attached script. but unfortuately for almost all of the tables, there is unexpected end of file error when attempting to copy the data into the tables. There may be a better chance for you to do the recover on your system. Can you try the following: 1) cd $NH_HOME/db/save 2) mkdir test 3) copy the data2.tar to $NH_HOME/db/save/test 4) cd $NH_HOME/db/save/test 5) tar xvf data2.tar 6) run the attached script as: prob_15211_noindex.sh > prob_15211_noindex.out, send us the prob_15211_noindex.out 6/18/2001 12:18:53 PM wburke -----Original Message----- From: Burke, Walter Sent: Monday, June 18, 2001 12:08 PM To: 'jim.maynard@wcom.com'; 'performance-management@wcom.com' Cc: Gray, Don; Keville, Bob; Ciavarro, Mike Subject: FW: Ticket # 50474 - Ingres Database Recovery Jim, As per on conversation on the phone this morning, I understand that, at this time, your resources are dedicated to a network emergency. As we discussed on Friday Afternoon, In order to begin work on our database, we will require the Raw Data directories for both the Nethealth and iidbdb portions of the dataBase. $II_DATABASE/data/default/nethealth and iidbdb directories. Please forward as soon as possible along with instructions on which order the tar files were created. Let me know if you need any assistance or clarification. Sincerely, 6/18/2001 3:36:25 PM yzhang Can you check to see if customer running the script ok, how many tables they can copy. and Is there any progres on collecting disk backup and the othere related information Thanks Yulun 6/19/2001 3:26:29 PM yzhang prob_15211_noindex.out 6/19/2001 3:27:12 PM yzhang let customer knows that I was trying to recover their data (data2.tar from ftp.concord.com) by using the attached script. but unfortuately for almost all of the tables, there is unexpected end of file error when attempting to copy the data into the tables. There may be a better chance for them to do the recover on your system. Can you try the following: 1) login as nhuser, source nethealthrc.sch, and cd $NH_HOME/db/save 2) mkdir test 3) copy the data2.tar and the attached script to $NH_HOME/db/save/test 4) cd $NH_HOME/db/save/test 5) tar xvf data2.tar 6) run the attached script as: prob_15211_noindex.sh > prob_15211_noindex.out send us the prob_15211_noindex.out 6/19/2001 3:45:44 PM yzhang it might work on their system, but to confirm the following: echo " select table_name, num_rows from iitables where table_name like '%stats0%'\g" | sql nethealth > stats0_table.out. if it works, they need to run nhiIndexstats Thanks Yulun 6/19/2001 5:57:02 PM yzhang there is no data has been copied into any of the stats0 tables, also the output file looks not from my script Jim, I would like you exactly follow the steps here, and use the script attached here. then send me the prob_15211_noindex.out. This will determine how many tables you can recover. 6/19/2001 7:14:55 PM wburke obtained 6/20/2001 8:58:26 AM yzhang Jim, Based on the output, none of the stats0 table has been recovered because of the unexpected end of the file error. Now you can do the following fo help us to recover your data. 1. Pull a Second Archive of the Tape Backup of the entire application. - You may need to FedEx this package to Walter Burke. 2. Extract the Raw Data files from $II_DATABASE/ingres/< data/default/nethealth $II_DATABASE/ingres/data/default/iidbdb tar and send instructions on the tar sequence. Thanks Yulun 6/20/2001 5:20:03 PM wburke all files on BAFS/50474/data 6/21/2001 9:19:29 AM yzhang Can you provide us with detail steps (the comand systax) to uncompress *.bz file. 6/21/2001 11:02:45 AM yzhang Don and Walter, It looks we need to download a software in order to uncompress customer's data. I am wondering if support can do the downloading and uncompressing, or ask customer redo the compress by using popular tool like tar, then gzip. Thanks Yulun 6/21/2001 7:18:11 PM wburke Uncompress completed. - keg: /disk3/nethealth/nethealth /disk3/iidbdb/iidbdb 6/22/2001 12:36:30 PM yzhang can you get these four files: $II_SYSTEM/ingres/files/install.log login as ingres ingprenv > ingprenv.out login as nh_user, source nethealthrc.csh env > env.out Walter, can you help him to see if he can get the nethealth install log. Bascally I want to do a install at exactly the same way as he did. Thanks Yulun 6/28/2001 1:53:14 PM wburke -----Original Message----- From: Burke, Walter Sent: Thursday, June 28, 2001 1:42 PM To: 'jim.maynard@wcom.com' Cc: 'greg.hodge@wcom.com'; Gray, Don; Zhang, Yulun; Mitton, Sean; 'steve.smith@wcom.com' Subject: Ticket # 50474 Jim, At this point, we have tried the following steps to recover your database gap for 5/8/01 - 6/6/01. 1. Retrieved Saved dB structure and attempted to load. - the files here were totally corrupted. - This is most likely due to the server crashing during the middle of the save process. 2. Retrieved raw data tables from the tape backups of the system. - These files failed to contain data-files for the gap. - These tapes were created after the crash of 6/6/01 Unfortunately neither of these steps were sucessfull. Moving forward, we should implement the sound recovery strategies based on your buisness environment. Please contact me if you have any questions. Sincerely, 7/17/2001 6:17:31 PM yzhang Jim, Attached is the script running on the FROM machine for copying each stats table into file. you run this script after you finish step 6 as instructed by Robin (that is, after the element_mapping table is created and filled). what you need to do is to copy this script into $NH_HOME/tmp, chmod to 777, then typing the script name, the output will write into cleanstats0.out located in $NH_HOME/tmp. Keep and send us this out file in case the script does not succeed. The script has been tested and running ok on my system. I will write you another script tomorrow to load the data into database. Yulun 7/19/2001 6:22:27 PM yzhang Jim, Here is the steps to do the data transfer and load: 1) make sure you have enough space on TO machine. 2) ftp the stats *.dat file from From machine to $NH_HOME/tmp of the TO machine (you can tar the dat file then ftp the tar, but you probably need do ftp several times due to the huge amount of data) 3) make sure that under $NH_HOME/tmp, only file ext with *.dat is the stats data file tranferred from FROM machine 4) place the attached script in $/NH_HOME/tmp of TO machine, and chang mode to 777 5) type the script name to run the script, the script only create table and load the data, 6) run nhiIndexStats to place index on the 2009 tables 7) append the nh_rlp_boundary table from the from machine to the to machine copy nh_rlp_boundary into a file from from machine then load this file into nh_rlp_boundary into the To machine. if all steps ok, then you all set. Before you excute the above steps for 2009 files you can ftp only several *.dat files see if the whole steps works, especially see if the script works on your system. I tested the script on my system. Let us for any question Thanks Yulun 7/27/2001 8:17:21 AM yzhang Andrew, Can you get the following for me, and email me the four files. from TO machine 1) echo "help\g" | sql $NH_RDBMS_NAME > help.out 2) echo "copy table nh_rlp_boundary() into 'nh_rlp_boundary_to.dat'\g" | sql $NH_RDBMS_NAME from FROM machine 1) echo "help\g" | sql $NH_RDBMS_NAME > help.out 2) echo "copy table nh_rlp_boundary() into 'nh_rlp_boundary_from.dat'\g" | sql $NH_RDBMS_NAME Basically I want to know if the stats tables on To machine have been properly indexed, and if the nh_rlp_boundary table was transfered in the way we expected. Thanks Yulun 8/2/2001 9:41:17 AM yzhang problem solved 6/7/2001 5:05:21 PM jpoblete Uncovered this while testing the command to troubleshoot a customer problem. Create a list of elements to be deleted, from the following sql command: echo "select name from nh_element\g" | sql nethealth > elements.txt formatted the file to delete only some names, then used this file to feed nhDeleteElements: nhDeleteElements -inFile elements.txt After the processing, the pollerAudit file will show the following for all the elements to be deleted: 2001/06/07-16:53:58-EDT jpoblete i success 0 delete z_d_zd_pooh510-link-5 "element not found" 2001/06/07-16:53:58-EDT jpoblete i success 0 delete z_d_zd_pooh510-link-6 "element not found" 2001/06/07-16:53:58-EDT jpoblete i success 0 delete z_d_zd_pooh510-link-7 "element not found" Of course thge elements do exist in the poller configuration and in the DB: segment "z_d_zd_pooh510-link-5" { agentAddress "172.16.15.132:5100" uniqueDeviceId "192.168.12.21" mibTranslationFile "bay-wellfleet-rrev7-mib2.mtf" index "5" statistics "1" discoverMtf "bay-wellfleet-rrev7-mib2.mtf" sysDescr "Image: fix/8.01/4 Created on Mon Mar 13 17:02:47 EST 1995." sysName "pooh:510" sysLoc "Berlin Miraustr." sysContact "Ralf Holzhueter Tel.: 030 43908593" ifType "0" } segment "z_d_zd_pooh510-link-6" { agentAddress "172.16.15.132:5100" uniqueDeviceId "192.168.12.21" mibTranslationFile "bay-wellfleet-rrev7-mib2.mtf" index "6" statistics "1" discoverMtf "bay-wellfleet-rrev7-mib2.mtf" sysDescr "Image: fix/8.01/4 Created on Mon Mar 13 17:02:47 EST 1995." sysName "pooh:510" sysLoc "Berlin Miraustr." sysContact "Ralf Holzhueter Tel.: 030 43908593" ifType "0" } segment "z_d_zd_pooh510-link-7" { agentAddress "172.16.15.132:5100" uniqueDeviceId "192.168.12.21" mibTranslationFile "bay-wellfleet-rrev7-mib2.mtf" index "7" statistics "1" discoverMtf "bay-wellfleet-rrev7-mib2.mtf" sysDescr "Image: fix/8.01/4 Created on Mon Mar 13 17:02:47 EST 1995." sysName "pooh:510" sysLoc "Berlin Miraustr." sysContact "Ralf Holzhueter Tel.: 030 43908593" ifType "0" } 6/21/2001 12:50:09 PM rtrei Yulun-- This is a lower priority than your current 5.0 beta bugs 8/7/2001 3:37:09 PM yzhang nhDeleteElement works fine on nh50, the ticket was delined ( 6/8/2001 12:15:08 PM wburke S_DU04C0_CKING_CATALOGS VERIFYDB: beginning check of DBMS catalogs for database nethealth S_DU1601_INVALID_ATTID Table iirelation (owner $ingres) has a mismatch in number of columns. iirelation indicates there are 45 but iiattribute contains 8224. S_DU0302_DROP_TABLE The recommended action is to drop table iirelation (owner $ingres) from the database. S_DU0300_PROMPT Enter Y for yes, N for no. > 6/8/2001 12:15:39 PM wburke Yulun is working with CA on this. 6/11/2001 3:57:51 PM yzhang Jose, I think this is the latest we received from CA, did you talk to customer on this one. Thanks The prognosis on this problem isn't good. The iirelation is corrupt. If this were any table except iirelation, then we'd have a much better chance < at recovering the data. But iirelation is special, it contains the information on all the catalogs in the database. If this is corrupt then you can't access the database (which they can't, in this case). In some cases we could unload the database, but that isn't an option here because ingres needs to access iirelation to get table information to create the unload scripts (it is also out of the question simply because we cannot run the sql terminal monitor either). There are a couple of options, neither of which are very appealing. The first is to recover the database from a backup. This option isn't too great because it's my understanding that the database is NOT checkointed, but rather some sort of disk backup is used instead, and this was done last weekend, making for a very old backup. The other option is to contact our Professional Services Group, the group to contact for extreme problems such as this. I've already give Bob Kelville the contacts for that.The downside is that these are billable services. Please keep me informed on what the customer wishes to do. Kind Regards, 6/11/2001 5:21:59 PM jpoblete Customer has decided to contact CA Professional Services. We will keep on hild this until we hear back from them. 7/22/2001 11:51:46 AM yzhang I noticed that the call ticket has been closed for this one, I believe I can close the problem ticket, right? Thanks Yulun 7/23/2001 2:59:03 PM yzhang problem solved 8/1/2001 4:01:46 PM dbrooks closed-see above note 6/13/2001 4:08:08 PM wburke Database save fails at Fatal Internal Error: Unable to execute 'COPY TABLE nh_stats0_991166399 () INTO '/opt/health/save.tdb/nh Customer is also working with a forced DB as the DB went inconsistent because one of their system admins added drives to the array that Nethealth resides while NH was running. Customer needs to get a recent Full save so that they can move the DB to a new machine running NH 4.8 on Monday June 18th. 6/21/2001 12:58:59 PM rtrei Yulun-- This has been sitting in Jim's queue for awhile. GIven that it isn't escalated, and you have beta blockers due by next Friday, I would place it as a lower priority than the beta bugs. However, the customer is ahving problems, so it is important and should be looked into as soon as you have the time. 6/21/2001 2:21:10 PM yzhang the problem table during db save is nh_stats0_991166399, I noticed that this table is not in the database, to be sure you need to check with customer to see if the table is existed in the currect database. you may want to chech to see if there exists corresponding physical file by running: select a.table_name, a.file_name from iifile_info a, iitables b where a.table_name = b.table_name and b.table_name = 'nh_stats0_991166399' If the physical file is there, mv and save to other place, then do a save. if the table is existed in the current database, test if they can copy this table to a file. if the tabel is not existed in the current db, can create mpty table as: create table nh_stats0_991166399 as select * from nh_rlp_status, then run dbsave. Thanks Yulun 6/21/2001 2:21:32 PM yzhang customer is up and running. 6/14/2001 11:23:22 AM foconnor Customer had a problem with the checkpoint save crashing because they ran out of disk space (a worker moved 45 gig worth of files/file to the database partition). The space problem has been remedied but the database went inconsistent and they have not been able to save the database because the saves fail with errors. So they cannot complete the save, destroy, create and load sequence. This is a large database and it takes about 5-6 hours to perform a database save. Customer is going to send me the save.log and the output of the nhCollectCustData command. 6/14/2001 11:29:35 AM foconnor Unloading table nh_var_units . . . Unloading the sample data . . . Fatal Internal Error: Unable to execute 'COPY TABLE nh_stats0_991166399 () INTO '/opt/health/save.tdb/nh_stats0_991166399'' (E_SC0206 An internal error prevents further processing of this query. Associated error messages which provide more detailed information about the problem can be found in the error log, II_CONFIG:errlog.log (Wed Jun 13 15:39:50 2001) ). (cdb/DuTable::saveTable) Table nh_stats0_991166399 is dated May 29, 15:59:59 2001 6/14/2001 12:02:32 PM foconnor -----Original Message----- From: O'Connor, Farrell Sent: Thursday, June 14, 2001 11:52 AM To: Zhang, Yulun Cc: O'Connor, Farrell Subject: Call ticket 50658; Problem ticket 15402 Yulun, The output of the nhCollectCustData is at \\BAFS\tickets\50000\50658\nhCollectDb_June14\........ 6/14/2001 1:39:56 PM schapman -----Original Message----- From: Zhang, Yulun Sent: Thursday, June 14, 2001 1:27 PM To: Chapman, Sheldon Cc: O'Connor, Farrell; Trei, Robin Subject: RE: Call ticket 50658; Problem ticket 15402 Sheldon, ****************** customer's ingres keep crashing, it is a Segmentation Violation, and the y customer is down. this ticket need to be escalated Yulun 6/14/2001 5:34:29 PM yzhang waiting to see if customer can load the recent backup 6/15/2001 12:55:40 PM foconnor Customer has filled up thier partition that Network Health and Ingres resides on. Last night customer had problems with destroying the database and we had to manually destroy the database. Customer attempted load last night but the disk partitiion filled up and the load.log file was created but empty. ingres crashed again so I had recommended that he either start deleting files, add disk space or both. Customer is going to add disk space, start up ingres, nhDestroyDb, nhCreateDb and nhLoaddb again. 6/18/2001 5:15:16 PM wburke -----Original Message----- From: Burke, Walter Sent: Monday, June 18, 2001 5:05 PM To: Gray, Don Cc: Zhang, Yulun Subject: FW: Ticket # 50658 , PT# 15402 Customer is back up and running after a manual Db destroy and load. Please de-escalate. -Walter -----Original Message----- From: Burke, Walter Sent: Monday, June 18, 2001 5:03 PM To: 'casabona@merck.com' Subject: Ticket # 50658 George, This is a notification to confirm that per your e-mail of 6/18/01, the issue we have been working on, Call Ticket 50658 has been closed. You are being sent this notification because Concord is continually striving to deliver the most robust quality customer service. If you have any further issues or questions, please contact support@concord.com . 6/19/2001 11:44:17 AM wburke -----Original Message----- From: Casabona, George M. [mailto:george_casabona@merck.com] Sent: Tuesday, June 19, 2001 10:36 AM To: 'Burke, Walter' Subject: RE: Ticket # 50658 Walter: Thanks for your call last night and help fixing the database. Sincerely, George Casabona 732 594 6429 7/23/2001 10:35:54 AM yzhang problem solved %6/18/2001 12:49:15 PM rrick Problem: Database crashes every few days - The errlog.log is showing "disk I/O errors..." and "write failed" errors 00000130 Wed May 23 03:12:39 2001 E_CL0606_DI_BADWRITE Error writing page to disk write() failed with operating system error 33 (The process cannot access the file because another process has locked a portion of the file.) 00000130 Wed May 23 03:12:39 2001 E_DMA44F_LG_WB_BLOCK_INFO An I/O error was encountered writing to the PRIMARY log file. At page 8406, an error was encountered writing 1 pages from buffer address 40F86600. The current log file page size is 4096, and the buffer address is 020 - Customer claims the disks "appear" fine in the NT event logs and CIM hardware logs. - Waiting for customer to send output of nhInfo.sh 6/18/2001 12:53:24 PM rrick Customer has about 275 stats elements and a few probes on this system. Please reference //bafs/escalated tickets/50000/50625 for error.log and console error message. Customer bumped up swap space to 2 gig and problem still exists. 6/18/2001 1:44:46 PM rlindberg this seems like really a DB problem and not a console problem. I'm re-assiging to R< obin to evaluate. 6/18/2001 3:32:04 PM yzhang Russell, This is the escalated ticket, can you check with customer to see if they can resize the transaction log. also the nhDbStatus output is in bmp format, how do you read it, use event view. they should run nhDbStatus as nhDbStaus nethealth > nhDbStatus.out. Also tell me when, at what point the ingres get crashed. Yulun 6/18/2001 3:49:08 PM yzhang Russell, Also check they have the full permissionon errlog.log. Run nhResizeIngresLog, copy the output into the file, and send the file. also customer should have their system administrator to check the if there is any disk problem get this as soon as possible, idf those are not the problem, I will create an issue with CA. Thanks Yulun 6/18/2001 4:16:49 PM rrick -----Original Message----- From: Rick, Russell Sent: Monday, June 18, 2001 4:06 PM To: 'st@skandia.com' Subject: RE: Call Ticket #50625 & Problem Ticket #15475 Hi Scott, Can you please supply the following info: 1. At what point does Ingres crash? Can you please be more specific? When the console comes up? When reports are being run? etc. 2. Please run diagnostics to see if the disks you are running Nethealth and the Ingres partition (if applicable) are working correctly. 3. What are the permissions on the following file: $NH_HOME\oping\ingres\files\errlog.log NOTE: This can be checked by doing the following.......cd to $NH_HOME\oping\ingres\files, then type "ls -al errlog.log" at the command prompt without the double quotes. Things to do: 1. Shutdown the Nethealth Console, by exiting the GUI. 2. Login as the nethealth user. 3. Shutdown the Nethealth Server, by executing the following commands at the command line: cd $NH_HOME nhServer stop 4. Please execute the following command at the command line: $NH_HOME\bin\nhDbStatus nethealth > nhDbStatus.out 5. Please resize the Ingres Transaction Log by executing the following command from the command line: $NH_HOME\bin\nhResizeIngresLog 1000 > nhResizeIngresLog.out 6. Please forward this information to support@concord.com, Attn: Russ Rick Regards, Russell K. Rick, Senior Support Engineer 6/19/2001 1:29:25 PM rrick -----Original Message----- From: Zhang, Yulun Sent: Monday, June 18, 2001 3:24 PM To: Rick, Russell Subject: 15475/50625 Russell, This is the escalated ticket, can you check with customer to see if they can resize the transaction log. also the nhDbStatus output is in bmp format, how do you read it, use event view. they should run nhDbStatus as nhDbStaus nethealth > nhDbStatus.out. Also tell me when, at what point the ingres get crashed. Yulun -----Original Message----- From: Zhang, Yulun Sent: Monday, June 18, 2001 3:42 PM To: Rick, Russell Subject: RE: 15475/50625 Russell, Also check they have the full permissionon errlog.log. Run nhResizeIngresLog, copy the output into the file, and send the file. also customer should have their system administrator to check the if there is any disk problem get this as soon as possible, idf those are not the problem, I will create an issue with CA. Thanks Yulun -----Original Message----- From: Zhang, Yulun Sent: Monday, June 18, 2001 5:05 PM To: Rick, Russell Subject: RE: 15475/50625 In addition to have them dry what we talked, they need to run following query: sql nethealth copy table nh_address () into './nh_address.dat' \g, watch if the copy is succeeded Yulun, The permissions on the errlog.log are 777. Ingres log was resized to 1000. Output on escalated tickets/50000/50625 6/20/2001 11:30:24 AM yzhang Russell, I noticed from escalated directory that the customer can resizeIngresLog, but from the copytable.out. I do not see if the copy table was succeeded. Have them do a database save from the command line to see if there is the same error message Thanks Yulun 6/21/2001 4:01:56 PM rrick Spoke with Scott: - System just crashed about 1 hour ago. - He re-booted the box to get the system up and polling again. - Getting nhCollectCustData 6/26/2001 11:38:54 AM rrick -----Original Message----- From: Trueblood, Scott [mailto:STrueblood@skandia.com] Sent: Tuesday, June 26, 2001 8:40 AM To: 'support@concord.com' Subject: Ticket# 50625 Here is the information requested by Russ Rick when the machine crashed. <> Scott Trueblood Skandia Technology Center Inc. Global Network Operations Center (203) 925-6936 P.O. Box 883 Shelton, CT. 06484-0883 6/26/2001 11:41:14 AM rrick -----Original Message----- From: Rick, Russell Sent: Tuesday, June 26, 2001 11:30 AM To: Zhang, Yulun Subject: RE: prob. 15475 Hi Yulun, I have received the DbCollect.tar. It is in the //bafs/escalated tickets/50000/50625 directory. Please let me know you need any help. Regards, Russell K. Rick, Senior Support Engineer 6/27/2001 10:32:16 AM rrick Customer Transaction log was set already at 1000mb. 6/27/2001 6:17:51 PM yzhang Russell, It looks E_CL0606 and E_DMA44F in the errlog.log may cause the problem of writting to transaction log, Can you check if customer is running a backup software, also are you sure that customer don't have disk problem? Thanks Yulun 6/28/2001 10:22:43 AM rrick -----Original Message----- From: Zhang, Yulun Sent: Wednesday, June 27, 2001 6:10 PM To: Rick, Russell Subject: 15475/50625 Russell, It looks E_CL0606 and E_DMA44F in the errlog.log may cause the problem of writting to transaction log, Can you check if customer is running a backup software, also are you sure that customer don't have disk problem? Thanks Yulun NILM with Scott: - Please call back. 6/29/2001 12:08:34 PM rrick -----Original Message----- From: Trueblood, Scott [mailto:STrueblood@skandia.com] Sent: Friday, June 29, 2001 9:15 AM To: 'Rick, Russell' Subject: RE: Call Ticket #50625 & Problem Ticket #15475 Importance: High Any ideas, This is killing me, The database crashes at least every day. -SBT -----Original Message----- From: Rick, Russell Sent: Friday, June 29, 2001 11:57 AM To: 'Trueblood, Scott' Subject: RE: Call Ticket #50625 & Problem Ticket #15475 Hi Scott, Please confirm the following: Are you running any third party backup package in-house on this disk? Regards, Russell K. Rick, Senior Support Engineer 6/29/2001 12:57:06 PM rrick -----Original Message----- From: Trueblood, Scott [mailto:STrueblood@skandia.com] Sent: Friday, June 29, 2001 12:10 PM To: 'Rick, Russell' Subject: RE: Call Ticket #50625 & Problem Ticket #15475 Yes, we are using backup exec. -----Original Message----- From: Rick, Russell Sent: Friday, June 29, 2001 12:14 PM To: 'Trueblood, Scott' Subject: RE: Call Ticket #50625 & Problem Ticket #15475 Do you know the vendor name of the backup software? Regards, Russell K. Rick, Senior Support Engineer 6/29/2001 3:31:23 PM rrick -----Original Message----- From: Rick, Russell Sent: Friday, June 29, 2001 3:20 PM To: 'st@skandia.com' Subject: RE: Call Ticket #50625 & Problem Ticket #15475 Hi Scott. Comments: Since you have a third party backup product, it is recommended to backup you Nethealth Database only using nhSaveDb. Then you can run the backup utility against that *.tdb file. NOTE: PLEASE DO NOT RUN THIS THIRD PARTY BACKUP PRODUCT AGAINST THE $NH_HOME/IDB DIRECTORY. IT LOCKS UP THE INGRES PROCESSES. We just had a meeting and I brought up your ticket. One colleague said that he had seen this issue before and it turned out to be the backup utility locking up the Ingres DB. How often do you backup the "idb" directory? Regards, Russell K. Rick, Senior Support Engineer 7/3/2001 10:24:45 AM rrick -----Original Message----- From: Trueblood, Scott [mailto:STrueblood@skandia.com] Sent: Monday, July 02, 2001 2:38 PM To: 'Rick, Russell' Subject: RE: Call Ticket #50625 & Problem Ticket #15475 Russ the backup software is backup exec from < Veritas. I have disabled the backup exec agent. I will let you know in a week. If I haven't crashed by then I would think the problem has been resolved. -SBT 7/23/2001 10:38:50 AM yzhang problem solved 6/18/2001 3:30:06 PM jnormandin - Statisitcs rollup failure due to non-recoverable DMT_SHOW error: Begin processing (06/17/2001 20:00:36). Error: Sql Error occured during operation (E_RD0060 Cannot access table information due to a non-recoverable DMT_SHOW error (Sun Jun 17 22:01:54 2001) 6/18/2001 4:37:41 PM yzhang Jason, their syslog shows a lot of message regarding limited resource, their db is about 8G, and only has 0.3 G available which is definitely not enough for carrying out our network processes including stats rollup. as we talked over the phone., they need to add disk space up to 8 G, then run stats rollup, also check how long they keep the stats0 data, they may consider setting to default (3 days), then run stats rollup, which will reduce the size of database a lot. Thanks Yulun 6/18/2001 5:37:13 PM cpaschal From: Paschal, Christine Sent: Monday, June 18, 2001 5:27 PM To: Zhang, Yulun Subject: 15488 / 50958 - Statisitcs Rollup failure due to non-recoverable DMT_SHOW error. Importance: High Hi Yulun, Blake sent in the attached file as the output of the script Jason put onto ftp/outgoing. He said the script aborted after the following sql error, but that the file was created anyway. He believes the script is incomplete. e_lq0059 unable to start up fetch csr command unexpected initial protocol response Thanks, Chris 6/18/2001 5:52:54 PM yzhang echo "select a.table_name, a.num_rows, a.number_pages, a.overflow_pages, a.table_pagesize,b.file_name,file_ext from iitables a, iifile_info b where a.table_name = b.table_name\g" | sql nethealth > iitables.out 6/19/2001 8:57:05 AM yzhang Have customer remove any files they don't use from nethealth partition, such as under $NH_HOME/tmp, $NH_HOME/db/save.....\, destroy any unused database located in $II_SYSTEM/ingres/data/default. then change the stats rollup setting to 2 (as polled), 6(1 hour sample), 70 (1 week sample) (this the default setting) from console/setup. they can run stats rollup after doing these. Thanks Yulun 6/20/2001 5:27:47 PM yzhang what in the directory is the nhCollectCustdData posted on 6/18. but the errlog indicated that their system catalog might have been corrupted. can you get the new errlog.log, the new stats rollup.log and sysmod.log ( if they don't have it, run sysmod and redirect into a file) Thanks Yulun 7/22/2001 11:55:33 AM yzhang problem solved 6/20/2001 11:07:24 AM foconnor Database is experiencing frequent inconsistencies. We have just corrected a database that was experiencing DMT_SHOW errors with the nhSaveDb, nhDestroyDb, nhCreateDb and nhLoadDb but the customer is saying that he has experienced this several times and wants to know why. I have his errlog.log file at \\BAFS\tickets\50000\50463\DbCollect.tar\IngresFileLogs. If you have the time can you review that file and comment. Yulun/Robin after reveiwing the errlog.log said to bug it. Collecting sysmod nethealth >sysmod.out output. Apparently the database catalogs keep getting corrupted. 6/21/2001 2:39:04 PM yzhang need system.log 6/21/2001 2:45:14 PM foconnor \\BAFS\tickets\50000\50463\DbCollect.tar\syslog.log emailed Yulun system.log 6/21/2001 3:18:52 PM yzhang You might want to talk to Jason about replacing the iiatribute physical file for the problem database. currently, we have no other good option besides detroy, create and reload Thanks Yulun 8/10/2001 8:36:43 AM mmcnally -----Original Message----- From: McNally, Mike Sent: Friday, August 10, 2001 8:23 AM To: Zhang, Yulun Subject: PT 15541"Database is frequently inconsistent" Yulun, Ticket 50463 has been closed due to the lack of response from the customer. The associated problem ticket can also be closed. Thanks, Mike 8/10/2001 8:36:54 AM mmcnally . .6/22/2001 4:06:27 PM rrick Customer ran a nhSaveDb in Ascii format and it failed on Tuesday with the following message.............In his save.log he gets the original error: "Fatal Internal Error: Ok. (none/)" After that the system seemed to have stopped polling until today when it started to poll on it own again. VERY STRANGE! I have retrieved a nhCollectCustData located in //BASF/Escalated Tickets/50000/50910 6/26/2001 5:43:53 PM yzhang reviewed the problem, talked to support about the following: 1) have customer keep their regular dbsave backup 2) run nhRemoteSaveDb -a ..... to see if the sats 0 can be saved in ascii format 3) create a new database as nethealth_t 4) load the regular save into nethealth_t then save nrethealth_t into ascii 6/28/2001 1:45:32 PM don this worked 1 6/25/2001 10:03:08 AM rrick Initial Issue: Missing two polling cycles because of XLIB error in the syslog.log. RESEARCH: In the Maintenance reports on Friday 6/22/01 the following messages occurred: ----- Job started by Scheduler at '06/22/2001 19:30:06'. ----- ----- $NH_HOME/bin/nhReset ----- Stopping Network Health servers. Network Health processes are not running Starting Network Health servers. Error: Unable to send message to another process - you may need to restart the Network Health server (Broken pipe). Jim, I am not sure what caused this Broken pipe error, but this definitely looks like the reason why this happened. I also am seeing some strange QEF and Sessions errors in the $NH_HOME/idb/ingres/files/errlog.log file that also occurred on Friday and I think has something to do with causing this issue. These are internal Ingres errors, which I am going to have to wait till Monday morning to talk to one of our Ingres engineers about. With further research into this issue I found that especially between 5am and 8am on Sat. 6/23/01 a tremendous amount of Health and Top_N reports were run against the data in your system. Some of the report logs reported database EXLOCK errors and database did not exist errors. My theory right now is that maybe too many reports are being run in such a short period of time that maybe we have a report overlap issue. This could lock the next report or process to run to be locked out of the database. The nhiConsole process could have been the final straw. So the Nethealth system recycled itself causing 2 poll periods to be missing while the system came down and back up again. All information exists on //bafs/escalated tickets/51000/51197. 10/1/2001 10:56:53 AM yzhang Walter, Can you check with customer to see if they still have problem, and write detail description about what is the current problem they have. Thanks Yulun 10/1/2001 11:01:14 AM wburke -----Original Message----- From: Jim Maynard [mailto:jim.maynard@wcom.com] Sent: Wednesday, June 27, 2001 4:24 PM To: 'Burke, Walter' Subject: RE: Ticket # 51197 not seeing any xlib or fork errors...seeing some reports fail...but no errors. thanks! 10/2/2001 10:46:43 AM beta program updating status per received more info from cust 10/10/2001 11:30:42 AM yzhang Walter, based on your update, the customer no longer has the xlib error, so this problem ticket will be closed, you need to have customer create another call ticket for their current report problem. Thanks Yulun 6/25/2001 2:11:47 PM wburke Nethealth Servers stopped due to ingres Stack Dump. - Spoke w/ Robin. - Needs to open this w/ CA. - Obtain new CollectCustData - older files on BAFS/40929/6-20-01 6/25/2001 4:22:48 PM wburke - obtained newest nhCollectCustData - BAFS/50929/6-25-01 7/25/2001 3:19:22 PM schapman the newest version of nhCollectCustData in on BAFS/50929/july25 7/26/2001 4:05:50 PM lemmon Yulun, Please update the status of this ticket. 7/27/2001 6:11:59 PM yzhang The customer now is up and running, asked support: 1) to give detail step for collecting core file 2) check what ingres and nhi processes are running at< the time of next crash 3) check if there is new entries in errlog.log at the time of next crash 4) at same I will create an with CA 8/21/2001 6:36:42 PM rkeville It appears that the stack dump is occuring during a select statement and has a segmetation vilolation. This is on WinNT. - Segmentation Violation (SIGSEGV) gforce_page(0x30a1d0) @ PC 30a6f8 SP fd60e0d0 PSR fe401004 G1 1 o0 84dd08 - GRAPEAPE::[33027, 0000092a]: Query: select count(*)from iicolumns where table_name='nh_element' and column_name='ip_address' This is occuring at another site as well, however there is no segmentation vilolation. This is on Solaris. Ticket - 53133 - 00000275 General Protection Exception @7025fac3 SP:43eaceb0 BP:43ead084 AX:0 CX:3923057d DX:be3026c BX:1859ae0 SI:1353900 DI:1346c1c - 00000275 Tue Jun 12 06:46:09 2001 E_DM9049_UNKNOWN_EXCEPTION An Unexpected Exception occurred in the DMF Facility, exception number 68197. - KINGKONG::[II\INGRES\1d6, 00000275]: Query: select count(*)from iicolumns where table_name='nh_element' and column_name='ip_address' NOTE: Could this be related to CA bug number: 94474 - Using multiple global temporary tables in an abf procedure which loops more than 64K times in a session causes error E_US1263If more sessions are run then the total number of iterations for all sessions must be > 64K. For this problem to be seen the code must perform at least 1 commit after thetemp tables have been created and also include code that makes use of internal temporary tables. #------------------------------------------------------------------------------------------------------------------------------------------------------ 8/29/2001 9:57:32 AM rkeville -----Original Message----- From: KWOK.LEE@chase.com [mailto:KWOK.LEE@chase.com] Sent: Wednesday, August 29, 2001 9:34 AM To: support; Support List (E-mail) Subject: RE: Call Ticket 50929 Nethealth Servers stopped due to ingres Sta ck Dump Sheldon, Nethealth stopped twice in the past 24 hours on the kingkong1 NT server. I generated the following logs for your review. Pls advise. Thanks, Kwok #---------------------------------------------------------------------------------------------------------------------------------------------------- Files are on BAFS. 8/29/2001 9:57:54 AM rkeville #------------------------------------------------------------------------------------------------------------------------------------------------------ 9/4/2001 2:21:21 PM yzhang Can youi get the following files for me to look at your problem $II_SYSTEM/ingres/files/config.dat $II_SYSTEM/ingres/files/symbol.tbl Thanks Yulun 9/5/2001 3:01:01 PM yzhang Sheldon, This customer need to do the follwing two things: 1) customer has stack size 131072, they need to doulbe this number, to do it login as ingres source net*.csh cbf DBMS Server, F1 Config find stack_size under name column, and double the figure through Edit option 2) turn the group buffer off through the following steps: login as ingres source net*.csh cbf DBMS Server, F1 Config Let the highlight as where it is F1 Cache Highlight 2k ,F1, Type Con to set dmf_group_site 0 F1, End Highlight DMF Cache 4K, F1, type Con to set dmf_group_site 0 F1, End Highlight DMF Cache 8k F1, type Con to set dmf_group_site 0 after they done the above stop and restart ingres again you might want to practice it before instructing the customer let me if you have question 9/6/2001 12:07:02 PM schapman Customer has implemented configuration changes per Yulun and I will check with him after the weekend to see if a stack dump has occurred. 9/10/2001 5:09:43 PM schapman The problem has not reoccurred I will be de-escalating this issue and monitoring it until the end of the week. 9/11/2001 10:53:06 AM rkeville Requested information from customer. 9/11/2001 2:11:33 PM yzhang FRANK, Here is the symbol.tbl and config.dat you requested. I think I already sent you this. but I received these from our support today. As to the effect of increasing steak size on turning group buffer off on ingres performance. I am still waiting from our customer. will let you know soon. Thanks Yulun 9/18/2001 3:59:26 PM schapman -----Original Message----- From: KWOK.LEE@chase.com [mailto:KWOK.LEE@chase.com] Sent: Tuesday, September 18, 2001 11:37 AM To: support; Support List (E-mail) Subject: Re: Call Ticket 50929 Nethealth Servers stopped due to ingres Stack D ump Sheldon, The same problem did not reoccur over the past few days. I agree we should close this ticket. Thanks, Kwok 9/18/2001 4:01:43 PM yzhang It looks like the change in settings has resolved this issue. Thanks for your assistance. 9/19/2001 9:30:15 AM dbrooks Issue resolved. see above note from customer. {6/26/2001 1:19:52 PM foconnor The data analysis s failed due to table nh_hourly_health being a heap and not a btree. I have sent you a dump of the table structures. 6/26/2001 1:34:17 PM yzhang will write a script 7/9/2001 11:00:19 AM rrick ---------------------- Forwarded by Daniel Annand/Sydney/Com Tech/AU on 09/07/2001 12:01 PM --------------------------- Daniel Annand/Sydney 09/07/2001 12:01 PM To: "O'Connor, Farrell" cc: Subject: Re: Call ticket 49018 (Document link: Daniel Annand) Hi here are the outputs (See attached file: lsla.txt)(See attached file: indexDiag.out)(See attached file: dk-k.txt) Regards Dan -----Original Message----- From: Rick, Russell Sent: Monday, July 09, 2001 10:45 AM To: Zhang, Yulun Subject: FW: Call ticket 49018 What do you recommend? Regards, Russell K. Rick, Senior Support Engineer 7/9/2001 12:44:44 PM yzhang my recommandation is to get engough disk space, then run nhiIndexDb. They have about 20 tables that have not been indexed properly. Before doing these you need to work with customer to make sure their database is consistant. Thanks Yulun 7/9/2001 5:31:05 PM rrick -----Original Message----- From: Rick, Russell Sent: Monday, July 09, 2001 10:50 AM To: Rick, Russell; Zhang, Yulun Subject: RE: Call ticket 49018 Yulun, One of their disks is at 95%. Regards, Russell K. Rick, Senior Support Engineer -----Original Message----- From: Zhang, Yulun Sent: Monday, July 09, 2001 12:36 PM To: Rick, Russell Subject: RE: Call ticket 49018 my recommandation is to get engough disk space, then run nhiIndexDb. They have about 20 tables that have not been indexed properly. Before doing these you need to work with customer to make sure their database is consistant. Thanks Yulun -----Original Message----- From: Rick, Russell Sent: Monday, July 09, 2001 5:18 PM To: 'dannand@comtech.com.au' Subject: Concord Communications Nethealth Support Reply..........RE: Call Ticket #49018 Hi Daniel, Notes: This Call Ticket has been put into an Escalated Status. I am the Senior Support Engineer assigned to your Escalated Call Ticket. Issue: The data analysis failed due to table nh_hourly_health being a heap and not a btree. Comments: My recommandation is to free up some disk space. Currently, you disk is 95% full. This will prevent Ingres from Indexing all of its tables. Then please re-run the following commands at the command line: $NH_HOME/bin/sys/nhiIndexDb -u nhuser -d nethealth $NH_HOME/bin/sys/nhiIndexDiag -u nhuser -d nethealth > indexDiag.out Please forward the "indexDiag.out file to support@concord.com, Attn: Russ Rick Regards, Russell K. Rick, Senior Support Engineer 7/11/2001 11:08:50 AM rrick -----Original Message----- From: dannand@comtech.com.au [mailto:dannand@comtech.com.au] Sent: Monday, July 09, 2001 7:45 PM To: Rick, Russell Cc: ttagg@comtech.com.au; cpagesinclair@comtech.com.au; nmarch@comtech.com.au Subject: Re: Concord Communications Nethealth Support Reply..........RE: Call Ticket #49018 Hi Russell, 1/ Do I have to stop the NH server or poller to run nhiIndexDb ? 2< / The disk that is 95% full is the backup disk, it is ONLY used for ufs dumps. This does not impact NetHealth. The disk could be unmounted and the only effect will be the failure of ufsdumps. The mount points are as follows: /u 59% full - contains $NH_HOME /u2 24% full - contains logs /u3 27% full - contains idb /u4 50% full - contains the NetHealth database saves. So I don't beleive space is an issue. From a nhDbStatus: Database size is 3.7Gb Database free space is 11.9Gb Thanks dan -----Original Message----- From: Rick, Russell Sent: Wednesday, July 11, 2001 10:56 AM To: Zhang, Yulun Subject: FW: Concord Communications Nethealth Support Reply..........RE: Call Ticket #49018 Hi Yulun, Since this customer does not have a space problem after all.....would running the re-indexing program cause andy problems? Also I would assume that I only have to have them take down the NH server and not the Ingres server? Is this correct? Regards, Russell K. Rick, Senior Support Engineer 7/11/2001 11:22:20 AM rrick -----Original Message----- From: Zhang, Yulun Sent: Wednesday, July 11, 2001 11:02 AM To: Rick, Russell Subject: RE: Concord Communications Nethealth Support Reply..........RE: Call Ticket #49018 yes, stop nhServer and run nhiIndexDb -----Original Message----- From: Rick, Russell Sent: Wednesday, July 11, 2001 11:10 AM To: 'dannand@comtech.com.au' Subject: Concord Communications Nethealth Support Reply..........RE: Call Ticket #49018 Hi Dan, Yes, please stop the NH server before running the nhiIndexDb program. Now that you have told me about the space on your drives, I agree.....you do not have a space issue. Sorry about that. Regards, - Russ 7/13/2001 3:27:53 PM rrick -----Original Message----- From: dannand@comtech.com.au [mailto:dannand@comtech.com.au] Sent: Thursday, July 12, 2001 5:40 AM To: Rick, Russell Cc: ttagg@comtech.com.au Subject: Re: Concord Communications Nethealth Support Reply..........RE: Call Ticket #49018 Hi Rick, The nhiIndexDiag was unsuccessful due to the nh_elem_outage table containing duplicates. Attached is the output of nhiIndexDiag (See attached file: indexDiag.out.12jul01) Thanks Dan "Rick, Russell" on 10/07/2001 07:17:46 AM To: "'dannand@comtech.com.au'" cc: Subject: Concord Communications Nethealth Support Reply..........RE: Call Ticket #49018 Hi Daniel, Notes: This Call Ticket has been put into an Escalated Status. I am the Senior Support Engineer assigned to your Escalated Call Ticket. Issue: The data analysis failed due to table nh_hourly_health being a heap and not a btree. Comments: My recommandation is to free up some disk space. Currently, you disk is 95% full. This will prevent Ingres from Indexing all of its tables. Then please re-run the following commands at the command line: $NH_HOME/bin/sys/nhiIndexDb -u nhuser -d nethealth $NH_HOME/bin/sys/nhiIndexDiag -u nhuser -d nethealth > indexDiag.out Please forward the "indexDiag.out file to support@concord.com, Attn: Russ Rick Regards, Russell K. Rick, Senior Support Engineer Domestic and International Customers Group Concord Communications, Inc. http://www.concord.com 600 Nickerson Road Marlboro, Ma. 01752 USA Toll Free: 888-832-4340 or 508-460-4646 Fax: 508-303-4343 Intl: 508-303-4300 Technical Support hours are 5:00 AM to 8:00 PM EST, Monday through Friday. Please address all responses to support@concord.com and include your assigned call ticket number in the subject field. This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. This footnote also confirms that this email message has been swept by the latest virus scan software available for the presence of computer viruses. - NOTICE - This message is confidential, and may contain proprietary or legally privileged information. If you have received this email in error, please notify the sender and delete it immediately. Internet communications are not secure. You should scan this message and any attachments for viruses. Under no circumstances do we accept liability for any loss or damage which may result from your receipt of this message or any attachments. 7/13/2001 3:36:59 PM rrick Yulun: I put the results also out to the escalated tickets directory, as well. 7/13/2001 3:54:43 PM yzhang have them do: echo "modify nh_elem_outage to btree unique on element_id, time_down, time_polling\g" | sql nethealth > outage.out 7/16/2001 4:58:08 PM rrick -----Original Message----- From: Zhang, Yulun Sent: Friday, July 13, 2001 3:46 PM To: Rick, Russell Subject: RE: Concord Communications Nethealth Support Reply..........RE: Call Ticket #49018 have them do: echo "modify nh_elem_outage to btree unique on element_id, time_down, time_polling\g" | sql nethealth > outage.out -----Original Message----- From: Rick, Russell Sent: Monday, July 16, 2001 4:45 PM To: 'dannand@comtech.com.au' Subject: Re: Concord Communications Nethealth Support Reply..........RE: Call Ticket #49018 Hi Dan, Please execute the following command line from the $NH_HOME directory and forward the outage.out file to support@concord.com, Attn: Russ Rick: echo "modify nh_elem_outage to btree unique on element_id, time_down, time_polling\g" | sql nethealth > outage.out Regards, Russell K. Rick, Senior Support Engineer 7/30/2001 1:37:18 PM rrick -----Original Message----- From: Rick, Russell Sent: Monday, July 30, 2001 1:24 PM To: 'dannand@comtech.com.au' Subject: Re: Concord Communications Nethealth Support Reply..........RE: Call Ticket #49018 Hi Dan, Were you ever able to acquire this information for me? Regards, Russell K. Rick, Senior Support Engineer 7/30/2001 7:54:41 PM rrick To: "Rick, Russell" cc: Subject: Re: Concord Communications Nethealth Support Reply..........RE: C all Ticket #49018 (Document link: Daniel Annand) Sorry for the slow response back. The command you sent me below was unsuccessful due to duplicate rows. Regards Dan 7/31/2001 1:28:57 PM rrick -----Original Message----- From: daniel.annand@didata.com.au [mailto:daniel.annand@didata.com.au] Sent: Monday, July 30, 2001 7:59 PM To: support@concord.com Subject: ATTN: Rick Russell ---- RE: Concord Communications Nethealth Support Reply..........RE: C all Ticket #49018 (See attached file: outage.out) ---------------------- Forwarded by Daniel Annand/Sydney/Com Tech/AU on 31/07/2001 09:58 AM --------------------------- Yulun, File requested in //bafs/49000/49018. 7/31/2001 5:59:40 PM rrick -----Original Message----- From: Rick, Russell Sent: Tuesday, July 31, 2001 4:16 PM To: 'daniel.annand@didata.com.au'; 'tommy.tagg@didata.com.au'; 'matt.cudworth@didata.com.au' Cc: Bailey, Tom Subject: Concord Communications Nethealth Support Reply..........RE: Call Ticket #49018 Hi Dan, 1. The following is the Database Troubleshooting Guide. I thought this might help you out: 2. In-order for us to go forward we need to clean up the duplicates. The following script will help do that: Please perform the following: a. Please download or FTP this attached file of the "cleanStats" script to the $NH_HOME directory. NOTE: If you are using FTP to download this file to a Unix box from a Windows NT box please make sure to FTP in binary format. NOTE: This file is NOT in zipped-up or compressed format. b. Login into the Network Health System as the nethealth user. c. Take down the Network Health console and server. d. source nethealthrc.csh e. Please execute the cleanStats script from the $NH_HOME directory: 1.To just get information regarding duplicate problems execute the following command from the command line: ./cleanStats > dupInfo.out 2.To also remove the< duplicates execute the following command from the command line: ./cleanStats clean > dedupinfo.out f. Please send me the output file "dupInfo.out". g. Bring up the Network Health server and console. h. Execute a manual Staistical Index from the command line in the $NH_HOME/bin/sys directory: nhiIndexStats -u $NH_USER -d $NH_RDBMS_NAME i. Please try to re-run the data-analysis. Did it work? If you have any problems or the issues, please feel free to contact support@concord.com, Attn: Russ Rick. My support hours are 11:30am - 8:00pm, e.s.t. Regards, Russell K. Rick, Senior Support Engineer Domestic and International Customers Group Concord Communications, Inc. http://www.concord.com 600 Nickerson Road Marlboro, Ma. 01752 USA Toll Free: 888-832-4340 or 508-460-4646 Fax: 508-303-4343 Intl: 508-303-4300 Technical Support hours are 5:00 AM to 8:00 PM EST, Monday through Friday. Please address all responses to support@concord.com and include your assigned call ticket number in the subject field. 8/6/2001 1:41:35 PM rrick From: daniel.annand@didata.com.au [mailto:daniel.annand@didata.com.au] Sent: Sunday, August 05, 2001 6:47 PM To: support@concord.com Subject: Attn: Russ Rick --- attempt at DataAnalysis fix. another delivery error - please let me know if you receive this Regards Dan ---------------------- Forwarded by Daniel Annand/Sydney/Com Tech/AU on 06/08/2001 08:46 AM --------------------------- Daniel Annand/Sydney 05/08/2001 05:28 PM To: Support@concord.com cc: Subject: Attn: Russ Rick --- attempt at DataAnalysis fix. I received a delivery error on first attempt so here it is again. ---------------------- Forwarded by Daniel Annand/Sydney/Com Tech/AU on 05/08/2001 05:27 PM --------------------------- Daniel Annand/Sydney 05/08/2001 05:26 PM To: support@concord.com cc: tommy.tagg@didata.com.au, Matt Cudworth/Sydney/Com Tech/AU@Com Tech, Neville March/Sydney/Com Tech/AU@Com Tech Subject: Attn: Russ Rick --- attempt at DataAnalysis fix. Hi Russ, I tried the cleanStats.sh script as you suggested and nothing happened. I copied in the script to $NH_HOME , changed it to UNIX format, set permissions and ran it as suggested. The dupInfo.out file was empty as was the dedupInfo.out file, they are not attached to this email. My opinion is that the $files variable is not being populated. I remember when Farrell sent me the exact same script and had a similar problem on this particular server. Question - What is held in the nh_elem_outage table? Is it important information? Does it relate to scheduled outages or is it elements that have been unavailable during a polling period? The reason I ask is we do not implement scheduled outages so I would rather delete the table than spend another month trying to remove duplicates. This is the output created by the cleanStats.sh script that you sent me - concordClean.out - as you can see the script needs back-ticks around the command 'date' in your original script, other than that there was no output (I ran the script 4 times hence the contents of the file). From the script it should contain the contents of $tables. I have no idea how the $tables variable is to be populated ( the sql is too convoluted for me! ) but it does not seem to be happening. Also, some of the echo statements were missing quotes in the main function. trend$ cat concordClean.out date Sunday August 5 15:40:32 EST 2001 Sunday August 5 15:59:59 EST 2001 Sunday August 5 16:00:55 EST 2001 trend$ This is the output of running nhiIndexStats and nhiDataAnalysis trend$ nhiIndexStats -u $NH_USER -d $NH_RDBMS_NAME Begin processing (05/08/2001 16:16:39). trend$ trend$ nhiDataAnalysis Begin processing (05/08/2001 16:17:12). Error: Unable to execute 'MODIFY nh_elem_outage TO MERGE' (E_US1595 MODIFY: nh_e lem_outage: table is not a btree; only a btree table can be modified to merge. (Sun Aug 5 02:23:10 2001) ). trend$ This is the output of nhiIndexDiag Problem encountered with analyzing table Error: Indexing problem: nh_elem_outage should have been btree but was HEAP. Table is lacking an index. Duplicate problem: Found 0 duplicates out of 558 rows for index job_schedule_ix on table nh_job_schedule. Analysis of indexes on database 'nethealth' for user 'emc' completed successfull y. This is the output of nhiIndexDb trend$ nhiIndexDb -u emc -d nethealth Creating the Table Structures and Indices . . . Non-Fatal database error on object: NH_ELEM_OUTAGE 05-Aug-2001 16:39:13 - Database error: -33000, E_US1591 MODIFY: table could no t be modified because rows contain duplicate keys. (Sun Aug 5 02:39:12 2001) Creating the Table Structures and Indices for sample tables . . . Index of database 'nethealth' for user 'emc' was unsuccessful. Your suggestion: 1. The following is the Database Troubleshooting Guide. I thought this might help you out: <> 2. In-order for us to go forward we need to clean up the duplicates. The following script will help do that: Please perform the following: a. Please download or FTP this attached file of the "cleanStats" script to the $NH_HOME directory. <> NOTE: If you are using FTP to download this file to a Unix box from a Windows NT box please make sure to FTP in binary format. NOTE: This file is NOT in zipped-up or compressed format. b. Login into the Network Health System as the nethealth user. c. Take down the Network Health console and server. d. source nethealthrc.csh e. Please execute the cleanStats script from the $NH_HOME directory: 1.To just get information regarding duplicate problems execute the following command from the command line: ./cleanStats > dupInfo.out 2.To also remove the duplicates execute the following command from the command line: ./cleanStats clean > dedupinfo.out f. Please send me the output file "dupInfo.out". g. Bring up the Network Health server and console. h. Execute a manual Staistical Index from the command line in the $NH_HOME/bin/sys directory: nhiIndexStats -u $NH_USER -d $NH_RDBMS_NAME i. Please try to re-run the data-analysis. Did it work? NOTICE - This message is confidential, and may contain proprietary or legally privileged information. If you have received this email in error, please notify the sender and delete it immediately. Internet communications are not secure. You should scan this message and any attachments for viruses. Under no circumstances do we accept liability for any loss or damage which may result from your receipt of this message or any attachments. Com Tech Communications is now trading as Dimension Data Australia -----Original Message----- From: daniel.annand@didata.com.au [mailto:daniel.annand@didata.com.au] Sent: Sunday, August 05, 2001 11:59 PM To: support@concord.com Cc: tommy.tagg@didata.com.au Subject: ATTN: Russ Rick Hi Russ, Last we spoke you asked for the person(s) involved in modifying our poller. The Concord call ticket number assigned to the problem was 36092. From the email I have been given, it looks like Bob Keville was assigned to the ticket. This modification to the poller is a main reason we have not applied patches since on-one really knows if the poller will be effected. A confirmation from Concord whether a patch upgrade will produce desired or undesired results would be beneficial. Regards Dan NOTICE - This message is confidential, and may contain proprietary or legally privileged information. If you have received this email in error, please notify the sender and delete it immediately. Internet communications are not secure. You should scan this message and any attachments for viruses. Under no circumstances do we accept liability for any loss or damage which may result from your receipt of this message or any attachments. Com Tech Communications is now trading as Dimension Data Australia 8/6/2001 1:48:41 PM yzhang Russell, What exactlly the customer's problem is? don't run cleanstats.sh, that script may not solve the problem Yulu< n 8/6/2001 2:03:23 PM yzhang grab and have them run index_elem_outage.sh from ~yzhang/scripts, just typing the script name, after sourcing nethealthrc.csh, index_elem_outage.sh > index_elem_outage then run nhiIndexDail and nhiIndexDb 8/6/2001 3:13:27 PM rrick From: Rick, Russell Sent: Monday, August 06, 2001 1:45 PM To: Zhang, Yulun Subject: RE: 15757 This is the output of nhiIndexDiag Problem encountered with analyzing table Error: Indexing problem: nh_elem_outage should have been btree but was HEAP. Table is lacking an index. Duplicate problem: Found 0 duplicates out of 558 rows for index job_schedule_ix on table nh_job_schedule. Analysis of indexes on database 'nethealth' for user 'emc' completed successfull y. This is the output of nhiIndexDb trend$ nhiIndexDb -u emc -d nethealth Creating the Table Structures and Indices . . . Non-Fatal database error on object: NH_ELEM_OUTAGE 05-Aug-2001 16:39:13 - Database error: -33000, E_US1591 MODIFY: table could no t be modified because rows contain duplicate keys. (Sun Aug 5 02:39:12 2001) Creating the Table Structures and Indices for sample tables . . . Index of database 'nethealth' for user 'emc' was unsuccessful. If you have any problems or the issues, please feel free to contact support@concord.com, Attn: Russ Rick. My support hours are 11:30am - 8:00pm, e.s.t. Regards, Russell K. Rick, Senior Support Engineer From: Zhang, Yulun Sent: Monday, August 06, 2001 1:54 PM To: Rick, Russell Subject: RE: 15757 grab and have them run index_elem_outage.sh from ~yzhang/scripts, just typing the script name, after sourcing nethealthrc.csh, index_elem_outage.sh > index_elem_outage then run nhiIndexDail and nhiIndexDb From: Rick, Russell Sent: Monday, August 06, 2001 2:37 PM To: Zhang, Yulun Subject: RE: 15757 Is this located on /home/eng/yzhang? If so where? If you have any problems or the issues, please feel free to contact support@concord.com, Attn: Russ Rick. My support hours are 11:30am - 8:00pm, e.s.t. Regards, Russell K. Rick, Senior Support Engineer Domestic and International Customers Group Concord Communications, Inc. http://www.concord.com 600 Nickerson Road Marlboro, Ma. 01752 USA Toll Free: 888-832-4340 or 508-460-4646 Fax: 508-303-4343 Intl: 508-303-4300 Technical Support hours are 5:00 AM to 8:00 PM EST, Monday through Friday. Please address all responses to support@concord.com and include your assigned call ticket number in the subject field. From: Zhang, Yulun Sent: Monday, August 06, 2001 2:42 PM To: Rick, Russell Subject: RE: 15757 in /home/eng/yzhang/scripts From: Rick, Russell Sent: Monday, August 06, 2001 2:57 PM To: 'dannand@comtech.com.au' Cc: Gray, Don; Bailey, Tom Subject: FW: Concord Communications Nethealth Support Reply..........RE: Call Ticket #49018 & Problem Ticket #15757 Hi Daniel, Comments: We need to acquire some additional information regarding this issue. Please follow the instructions below: Instructions: Please perform the following: 1. Login into the Network Health System as the nethealth user. 2. Please download or FTP this attached file of the "index_elem_outage.sh" script to the $NH_HOME/bin directory. NOTE: If you are using FTP to download this file to a Unix box from a Windows NT box please make sure to FTP in binary format. NOTE: This file is NOT in zipped-up or compressed format. 3. Please execute the following from the $NH_HOME directory: source nethealthrc.csh 4. Please execute this script from the $NH_HOME/bin directory: index_elem_outage.sh > index_elem_outage 5. Then, please execute the nhiIndexDiag script as you were before. 6. Also, please execute the nhiIndexDb, as you were before. 7. Please forward the output to support@concord.com, Attn: Russ Rick. If you have any problems or the issues, please feel free to contact support@concord.com, Attn: Russ Rick. My support hours are 11:30am - 8:00pm, est. Regards, Russell K. Rick, Senior Support Engineer Domestic and International Customers Group Concord Communications, Inc. http://www.concord.com 600 Nickerson Road Marlboro, Ma. 01752 USA Toll Free: 888-832-4340 or 508-460-4646 Fax: 508-303-4343 Intl: 508-303-4300 Technical Support hours are 5:00 AM to 8:00 PM EST, Monday through Friday. Please address all responses to support@concord.com and include your assigned call ticket number in the subject field. 8/8/2001 1:08:53 PM rrick From: daniel.annand@didata.com.au [mailto:daniel.annand@didata.com.au] Sent: Wednesday, August 08, 2001 3:24 AM To: support@concord.com Cc: tommy.tagg@didata.com.au; matt.cudworth@didata.com.au Subject: RE: Attn : Russ RickRe: RE: CallTicket #49018 & Problem Ticket #1 5757 The output of your script is attached. It looks like it didn't index due to duplicate rows so I didn't do the nhIndexDb etc... (See attached file: index_elem_outage.out) Regards Dan From: Rick, Russell Sent: Wednesday, August 08, 2001 12:54 PM To: Zhang, Yulun Subject: FW: Attn : Russ RickRe: RE: CallTicket #49018 & Problem Ticket #1 5757 Yulun, The script did not work. Is there any other way to remove these duplicates? If you have any problems or the issues, please feel free to contact support@concord.com, Attn: Russ Rick. My support hours are 11:30am - 8:00pm, e.s.t. Regards, Russell K. Rick, Senior Support Engineer Domestic and International Customers Group Concord Communications, Inc. http://www.concord.com 600 Nickerson Road Marlboro, Ma. 01752 USA Toll Free: 888-832-4340 or 508-460-4646 Fax: 508-303-4343 Intl: 508-303-4300 Technical Support hours are 5:00 AM to 8:00 PM EST, Monday through Friday. Please address all responses to support@concord.com and include your assigned call ticket number in the subject field. 8/9/2001 1:16:37 PM yzhang follow the exapme in ~yzhang/scripts/claenElemAddrDup.sh to come up with a script to clean the duplicate in nh_elem_outage: note: there is only primary index for nh_elem_outage : it is "Modift nh_elem_outage to btree unique on element_id, time_down, time_polling. no secondary index for nh_elem_outage I would like to review the script before you ship it out. Yulun 8/9/2001 4:19:42 PM rrick From: Zhang, Yulun Sent: Thursday, August 09, 2001 1:07 PM To: Rick, Russell Subject: RE: Attn : Russ RickRe: RE: CallTicket #49018 & Problem Ticket #1 5757 follow the exapme in ~yzhang/scripts/claenElemAddrDup.sh to come up with a script to clean the duplicate in nh_elem_outage: note: there is only primary index for nh_elem_outage : it is "Modift nh_elem_outage to btree unique on element_id, time_down, time_polling. no secondary index for nh_elem_outage I would like to review the script before you ship it out. Yulun From: Rick, Russell Sent: Thursday, August 09, 2001 4:05 PM To: Zhang, Yulun Subject: RE: Attn : Russ RickRe: RE: CallTicket #49018 & Problem Ticket #15757 Yulun, There is no file ~yzhang/scripts/claenElemAddrDup.sh. I did find a file with the name.....~yzhang/scripts/cleanElemAddrDup_12022.sh. It is pretty involved. Do I replace the names of the tables only? If you have any problems or the issues, please feel free to contact support@concord.com, Attn: Russ Rick. My support hours are 11:30am - 8:00pm, e.s.t. Regards, Russell K. Rick, Senior Support Engineer Domestic and International Customers Group Concord Communications, Inc. http://www.concord.com 600 Nickerson Road Marlboro, Ma. 01752 USA Toll Free: 888-832-4340 or 508-460-4646 Fax: 508-303-4343 Intl: 508-303-4300 Technical Support hours are 5:00 AM to 8:00 PM EST, Monday through Friday. Please address all responses to support@concord.com and include your assigned call ticket number in the subject field. 8/9/2001 4:33:57 PM rrick From: Zhang, Yulun Sent: Thursday, August 09, 2001 4:17 PM To: Rick, Russell Subject: RE: Attn : Russ RickRe: RE: CallTicket #49018 & Problem < Ticket #15757 I will write the script for you later 8/13/2001 3:14:25 PM yzhang Have customer run ~yzhang/scripts/cleanElemOutageDup.sh just by tying the script name after sourcing nethealthrc.csh. after running the script, send me cleanElemOutageDup.out located in $NH_HOME/tmp. The script has been tested Thanks Yulun 8/13/2001 4:03:20 PM rrick From: Zhang, Yulun Sent: Monday, August 13, 2001 3:05 PM To: Rick, Russell Subject: prob. 15757 Have customer run ~yzhang/scripts/cleanElemOutageDup.sh just by tying the script name after sourcing nethealthrc.csh. after running the script, send me cleanElemOutageDup.out located in $NH_HOME/tmp. The script has been tested Thanks Yulun From: Rick, Russell Sent: Monday, August 13, 2001 3:49 PM To: 'dannand@comtech.com.au' Cc: Gray, Don; Bailey, Tom Subject: RE: Concord Communications Nethealth Support Reply..........RE: Call Ticket #49018 & Problem Ticket #15757 Hi Daniel, Instructions: Please perform the following: 1. Login into the Network Health System as the nethealth user. 2. Please download or FTP this attached file "cleanElemOutageDup.sh" script to the $NH_HOME/bin directory. NOTE: If you are using FTP to download this file to a Unix box from a Windows NT box please make sure to FTP in binary format. NOTE: This file is NOT in zipped-up or compressed format. 3. Please execute the following from the $NH_HOME directory: source nethealthrc.csh 4. Please execute this script from the $NH_HOME/bin directory by typing the following: cleanElemOutageDup.sh 5. Please forward the $NH_HOME/tmp/cleanElemOutageDup.out output to support@concord.com, Attn: Russ Rick. If you have any problems or the issues, please feel free to contact support@concord.com, Attn: Russ Rick. My support hours are 11:30am - 8:00pm, est. Regards, Russell K. Rick, Senior Support Engineer Domestic and International Customers Group Concord Communications, Inc. http://www.concord.com 600 Nickerson Road Marlboro, Ma. 01752 USA Toll Free: 888-832-4340 or 508-460-4646 Fax: 508-303-4343 Intl: 508-303-4300 Technical Support hours are 5:00 AM to 8:00 PM EST, Monday through Friday. Please address all responses to support@concord.com and include your assigned call ticket number in the subject field. 8/14/2001 11:19:29 AM yzhang Have customer do current dbsave and posted on ~ftp/incoming site, then let me know. Thanks Yulun 8/14/2001 1:43:45 PM rrick From: Zhang, Yulun Sent: Tuesday, August 14, 2001 9:57 AM To: Rick, Russell Subject: FW: prob. 15757 Russall, This is a ascii file, use ascii when ftp From: Zhang, Yulun Sent: Tuesday, August 14, 2001 11:09 AM To: Rick, Russell Subject: prob. 15757 Have customer do current dbsave and posted on ~ftp/incoming site, then let me know. Thanks Yulun From: Rick, Russell Sent: Tuesday, August 14, 2001 1:30 PM To: 'dannand@comtech.com.au' Cc: Gray, Don; Bailey, Tom Subject: RE: Concord Communications Nethealth Support Reply..........RE: Call Ticket #49018 & Problem Ticket #15757 Hi Daniel, Please also execute an nhSaveDb and post the *.tdb and nhSaveDb.log file to the ftp.concord.com/incoming directory. Please make a directory in the incoming directory named "Comtech". If you have any problems or the issues, please feel free to contact support@concord.com, Attn: Russ Rick. My support hours are 11:30am - 8:00pm, e.s.t. Regards, Russell K. Rick, Senior Support Engineer Domestic and International Customers Group Concord Communications7/6/2001 3:00:41 PM wburke scheduled dbSave fails randomly, about once a week, differerent days 'D:/nethealth/db/save/Daily.tdb/nh_daily_symbol_1000007'' (E_RD0060 Cannot access table information due to a non-recoverable DMT_SHOW error (Tue Jun 19 22:26:23 2001)). (cdb/DuTable::saveTable) - The next scheduled save runs fine - Manual dbSaves run fine. - Table in question is not static. - i.e. , one failure is on 100007, the next on 100003, the next on nh_schedule_job , etc. - sysmod of db and verify completed successfully. - recreate and reload of db completed successfully. - However, dbSave failure still crops up. obtainingn dB for inhouse testing. 7/13/2001 11:31:59 AM wburke ----- Job started by Scheduler at '7/12/2001 10:00:35 PM'. ----- ----- $NH_HOME/bin/sys/nhiSaveDb -u $NH_USER -d $NH_RDBMS_NAME -p D:/nethealth/db/save/Daily.tdb ----- Begin processing (7/12/2001 10:00:36 PM). Copying relevant files (7/12/2001 10:00:40 PM). Unloading the data into the files, in directory: 'D:/nethealth/db/save/Daily.tdb/'. . . Unloading table nh_active_alarm_history . . . Unloading table nh_active_exc_history . . . Unloading table nh_alarm_history . . . Unloading table nh_alarm_rule . . . Unloading table nh_alarm_threshold . . . Unloading table nh_alarm_subject_history . . . Unloading table nh_bsln_info . . . Unloading table nh_bsln . . . Unloading table nh_calendar . . . Unloading table nh_calendar_range . . . Unloading table nh_assoc_type . . . Unloading table nh_elem_latency . . . Unloading table nh_element_class . . . Unloading table nh_elem_outage . . . Unloading table nh_elem_alias . . . Unloading table nh_element_ext . . . Unloading table nh_elem_analyze . . . Unloading table nh_enumeration . . . Unloading table nh_exc_profile_assoc . . . Unloading table nh_exc_subject_history . . . Unloading table ex_tuning_info . . . Unloading table exception_element . . . Unloading table exception_text . . . Unloading table nh_exc_profile . . . Unloading table ex_thumbnail . . . Unloading table nh_exc_history . . . Unloading table nh_list . . . Unloading table nh_list_item . . . Unloading table hdl . . . Unloading table nh_elem_latency . . . Unloading table nh_col_expression . . . Unloading table nh_element_type . . . Unloading table nh_le_global_pref . . . Unloading table nh_elem_type_enum . . . Unloading table nh_elem_type_var . . . Unloading table nh_variable . . . Unloading table nh_mtf . . . Unloading table nh_address . . . Unloading table nh_node_addr_pair . . . Unloading table nh_nms_defn . . . Unloading table nh_elem_assoc . . . Unloading table nh_job_step . . . Unloading table nh_list_group . . . Unloading table nh_run_schedule . . . Unloading table nh_run_step . . . Unloading table nh_job_schedule . . . Unloading table nh_system_log . . . Unloading table nh_step . . . Unloading table nh_schema_version . . . Unloading table nh_stats_poll_info . . . Unloading table nh_import_poll_info . . . Unloading table nh_protocol . . . Unloading table nh_protocol_type . . . Unloading table nh_rpt_config . . . Unloading table nh_rlp_plan . . . Unloading table nh_rlp_boundary . . . Unloading table nh_stats_analysis . . . Unloading table nh_subject . . . Unloading table nh_schedule_outage . . . Unloading table nh_element . . . Unloading table nh_deleted_element . . . Unloading table nh_var_units . . . Unloading the sample data . . . Unloading the latest sample data definition info . . . Unloading table nh_rlp_boundary . . . Unloading table nh_stats_poll_info . . . Unloading table nh_import_poll_info . . . Unloading table nh_bsln_info . . . Unload the dac tables. . . Unloading table nh_daily_exceptions_1000001 .... Unloading table nh_daily_symbol_1000001 .... Unloading table nh_daily_health_1000001 .... Unloading table nh_hourly_health_1000001 .... Unloading table nh_hourly_volume_1000001 .... Unloading table nh_daily_exceptions_1000002 .... Unloading table nh_daily_symbol_1000002 .... Unloading table nh_daily_health_1000002 .... Unloading table nh_hourly_health_1000002 .... Unloading table nh_hourly_volume_1000002 .... Unloading table nh_daily_exceptions_1000003 .... Unloading table nh_daily_symbol< _1000003 .... Unloading table nh_daily_health_1000003 .... Unloading table nh_hourly_health_1000003 .... Unloading table nh_hourly_volume_1000003 .... Unloading table nh_daily_exceptions_1000004 .... Unloading table nh_daily_symbol_1000004 .... Unloading table nh_daily_health_1000004 .... Unloading table nh_hourly_health_1000004 .... Unloading table nh_hourly_volume_1000004 .... Unloading table nh_daily_exceptions_1000005 .... Unloading table nh_daily_symbol_1000005 .... Unloading table nh_daily_health_1000005 .... Unloading table nh_hourly_health_1000005 .... Unloading table nh_hourly_volume_1000005 .... Unloading table nh_daily_exceptions_1000006 .... Unloading table nh_daily_symbol_1000006 .... Unloading table nh_daily_health_1000006 .... Unloading table nh_hourly_health_1000006 .... Unloading table nh_hourly_volume_1000006 .... Unloading table nh_daily_exceptions_1000007 .... Unloading table nh_daily_symbol_1000007 .... Fatal Internal Error: Unable to execute 'COPY TABLE nh_daily_symbol_1000007 () INTO 'D:/nethealth/db/save/Daily.tdb/nh_daily_symbol_1000007'' (E_RD0060 Cannot access table information due to a non-recoverable DMT_SHOW error (Thu Jul 12 22:25:38 2001) ). (cdb/DuTable::saveTable) ----- 8/14/2001 9:21:35 AM mwickham -----Original Message----- From: Wickham, Mark Sent: Tuesday, August 14, 2001 09:07 AM To: Lemmon, Jim Subject: Problem Ticket 16044 (Call Ticket 51075) Jim, Could you provide a status on the subject problem ticket, "DbSave fails randomly: nhiSaveDb.exe: Fatal Internal Error: Unable to execute 'COPY"? Thanks - Mark 8/15/2001 7:38:25 AM mwickham -----Original Message----- From: Lemmon, Jim Sent: Tuesday, August 14, 2001 10:54 PM To: Zhang, Yulun Cc: Wickham, Mark; Trei, Robin Subject: RE: Problem Ticket 16044 (Call Ticket 51075) Yulun, Please evaluate this ticket. If it becomes a time-sink, please let Robin know it will impact Beta 4. /Jim 8/15/2001 9:37:43 AM yzhang Mark, Have customer do the following, and send nh_daily_symbol_1000007.out, and iivdb.log echo "help table nh_daily_symbol_1000007\g" | sql $NH_RDBMS_NAME > nh_daily_symbol_1000007.out verifydb -mreport -sdbname $NH_RDBMS_NAME -otable nh_daily_symbol_1000007 If you can not get these information by noon, load the customer's database, (make sure to use the same version of nethealth as customer), let me after you finish the load. Thanks Yulun 8/29/2001 2:06:35 PM lemmon Still no info. Next step is for Yulun to ask support again. 9/14/2001 7:44:55 AM tbailey Note that the only survivor from this company is the CEO. We are unable to get additional information at this time. 9/18/2001 5:56:56 PM yzhang In this case we should close the tiket, and let them know that they can open new one whenever they have problem. -7/9/2001 4:00:02 PM wburke Dbsave after force consistent. > >The DB save terminated abnormally: > >Fatal Internal Error: Unable to execute 'COPY TABLE nh_run_schedule () INTO >'/nh/nethealth/db/save/support.tdb'' (E_CO0029 COPY: Copy terminated >abnormally). 0 rows successfully copied. > >I'll be in a meeting from 2-3 so please call me afterwards. > 7/9/2001 5:49:18 PM yzhang send them ~yzhang/scripts/16098.sh, just run the script by typing the name. the script has been tested 7/10/2001 8:57:53 AM yzhang drop nh_run_step manually, if not work, drop it with verifydb,then I will write you a script to create the table 7/10/2001 12:13:16 PM yzhang get the script from ~/yzhang/scripts/create_run_step_16098.sh. It has been tested. just type the script name to run it. 7/10/2001 4:13:52 PM wburke -----Original Message----- From: Jason Zawacki [mailto:jzawacki@appliedtheory.com] Sent: Tuesday, July 10, 2001 1:37 PM To: support@concord.com Subject: Ticket #51661 Please close this ticket. Thank you Walter and Steve for your help! --- Jason Zawacki UNIX Engineer Appliedtheory Corporation Network Engineering Unit (315)453-2912 x5881 reloaded ingres installed from saved db. 7/10/2001 4:14:15 PM wburke x 7/23/2001 10:43:59 AM yzhang problem solved 7/9/2001 5:24:49 PM smoran Minor annoyance - the nhDbStatus command does not accept redirection of the output to a file: $ nhDbStatus > /tmp/nhDbStatus.log Database Name: nethealth Database Size: 1804869632.00 bytes RDBMS Version: OI 2.0/9712 (su4.us5/00) Hello Al: This is in response to call ticket # 51652 concerning "nhDbStatus does not accept redirection of output". The problem is that the output goes to standard error and not standard out. Use these: On UNIX - sh: nhDbStatus > /tmp/dbStat.txt 2>&1 - csh: nhDbStatus >& /tmp/dbStat.txt On NT - nhDbStatus > /tmp/dbStat.txt 2>&1 Please let me know if this answers your question. Thank you, Steven Moran -----Original Message----- From: Sorrell, Al [mailto:Al_Sorrell@troweprice.com] Sent: Monday, July 09, 2001 4:57 PM To: 'Moran, Steven' Subject: RE: 51652 concerning "nhDbStatus does not accept redirection of outp ut". Steve, Um, yes, that is a workaround, but certainly not the way it should work (i.e., a bug). Al 9/1/2001 3:19:19 AM AR_ESCALATOR Administrative change. This ticket has been created as an Enhancement Request. {7/12/2001 10:50:51 AM cestep Customer is running: Nethealth 4.8 - P3 HP-UX 11.00 Problem: Customer gets "No DBMS servers" on Nethealth console. Database is stopping intermittently, after receiving memory errors. From the errlog.log: (hpb.us5/00) Server -- Normal Startup. ::[II_RCP , 00000000]: Thu Jul 12 07:46:56 2001 E_SC0204_MEMORY_ALLOC Error allocating memory. ::[II_RCP , 00000000]: Thu Jul 12 07:46:56 2001 E_CL2504_CS_BAD_PARAMETER Invalid parameter on CS call. Stack dmp name 55365 pid 8470 session 0: 6A16E1D0: scd_note(00000078,00000001,00000001,00000001) Stack dmp name 55365 pid 8470 session 0: 6A16E1D0: scs_initiate(00000078,00000001,00000001,00000001) Stack dmp name 55365 pid 8470 session 0: 6A16E150: scs_sequencer(00000007,00000009,402381A0,00000000) Stack dmp name 55365 pid 8470 session 0: 6A16DB10: scs_sequencer(402381A0,00000000,00000000,00000000) Stack dmp name 55365 pid 8470 session 0: 6A16C350: CSMT_setup(00000001,402381A0,402382E4,00000000) Stack dmp name 55365 pid 8470 session 0: 6A16C090: C0DD9D58(00000000,00000000,00000000,00000000) ::[55365, ]: Thu Jul 12 07:46:56 2001 Bus Error (SIGBUS) scd_note(0x684f0) @ PC = 685f0, SP = 6a16e1d0, PSW = 4001f ::[II_RCP , 00000000]: Thu Jul 12 07:46:56 2001 E_CL2514_CS_ESCAPED_EXCEPTION An exception has escaped from user session. This session will be terminated. Stack dmp name 55365 pid 8470 session 0: 6A16DB90: IICSMTintr_ack(000001BC,00000001,00000000,00000000) Stack dmp name 55365 pid 8470 session 0: 6A16DB90: IICSintr_ack(000001BC,00000001,00000000,00000000) Stack dmp name 55365 pid 8470 session 0: 6A16DB50: scs_sequencer(00039C73,00039C5C,00039C2B,00039C21) Stack dmp name 55365 pid 8470 session 0: 6A16DB10: scs_sequencer(0003983A,00039834,0003981C,00039803) Stack dmp name 55365 pid 8470 session 0: 6A16C350: CSMT_setup(00000004,402381A0,402382E4,00000000) Stack dmp name 55365 pid 8470 session 0: 6A16C090: C0DD9D58(00000000,00000000,00000000,00000000) ::[55365, ]: Thu Jul 12 07:46:56 2001 Bus Error (SIGBUS) IICSMTintr_ack(0x4af18c) @ PC = 4af1c0, SP = 6a16db90, PSW = 4001f ::[II_RCP , 00000000]: Thu Jul 12 07:46:56 2001 E_CL25FF_CS_FATAL_ERROR The server has encountered a FATAL error. The server will be terminated. Stack dmp name 55365 pid 8470 session 0: 6A16C690: C01810DC(00000002,40592008,4059488D,00000000) Stack dmp name 55365 pid 8470 session 0: 6A16C690: C0183900(00000002,40592008,4059488D,00000000) Stack dmp name 55365 pid 8470 session 0: 6A16C590: C0183790(400F5EA0,00000008,00000000,00000000) Stack dmp name 55365 pid 8470 session 0: 6A16C490: C02056D4(00000000,00000000,00000000,00000001) Stack dmp name 55365 pid 8470 session 0: 6A16C450: C0DDA< 280(6A1A14B8,00000000,6A16C000,00000000) Stack dmp name 55365 pid 8470 session 0: 6A16C3D0: C0DDA118(400D4994,00000000,00000000,00000000) Stack dmp name 55365 pid 8470 session 0: 6A16C350: CSMT_setup(400D498C,00000000,00000000,00000400) Stack dmp name 55365 pid 8470 session 0: 6A16C090: C0DD9D58(00000000,00000000,00000000,00000000) ::[55365, ]: Thu Jul 12 07:46:56 2001 Bus Error (SIGBUS) @ PC = c01810dc, SP = 6a16c690, PSW = 4ff1f ::[II_RCP , 00000000]: Thu Jul 12 07:46:56 2001 E_CL25FF_CS_FATAL_ERROR The server has encountered a FATAL error. The server will be terminated. ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Steps already taken: - Looked for Core dump, could not find one. - Tried to unlimit the stack size, recreated the transaction log and resized it. - After that Nethealth and Ingres ran for about 24 hours before repeating the same errors. - Told the customer that the problem may be with their system. - He had the Unix admins look, but they did not find anything wrong. - Received the /var/adm/syslog/syslog.log, which does not show any system problems. All files are on BAFS, under ticket #51461. 7/12/2001 11:10:58 AM yzhang asked support to send new nhReset script, then check the ingres processes before running nethealth 8/16/2001 10:31:59 AM yzhang support is actually working with customer recently 8/23/2001 11:01:08 AM mmcnally This problem is resolved. We reinstalled NH and the database. 7/16/2001 2:56:28 PM foconnor Upon upgrading to 4.8 on a distributed polling site, the customer uninstalls Network Health on the remotes and then installs 4.8. However the customer did not save the databases on the remotes but did save the poller.cfg files thinking that is all they need to do. Now because they did not save the databases and load the database after upgrading the OS and Network Heatlh, the element_id's on the remotes are not the same as the same elements on the central and as a result the data recieved from the remotes is being inserted into the wrong elements. I would like to get a procedure that will restore the proper name and element_id associations that are in the central database to the remote databases. The customer is effectively down since the newly polled data does not align with the historic data. One procedure suggested (by Art Hamlin if I understood him correctly) was to build a temp table called nh_centralsite_element which would have the name and element_id's from the central (correct) and create these tables on the remotes. --Perform one last remote save on the remotes and stop polling. --copy the nh_element table into a nh_element_corrupt table on both the remotes --drop the indices on the nh_element tables on the remotes --modify the nh_element_tables (on the remotes)using the information from the temp table nh_centralsite_element Can you create a step by step procedure to make the name/element_id on the remotes to match the central. 7/17/2001 1:38:17 PM yzhang Farrell, Make sure the poller.cfg is loacted in $NH_HOME/poller, then follow the steps here: 1) find out the element_id range in the central site 2) change Server_IDv in nethealthrc.sh file to match the element_id in the central site (follow the steps of changing Server_id from nethealth guide) 3) destroy database 4)create database 5)reboot machine Let me know if you have questions Yulun 7/17/2001 3:19:10 PM foconnor Spoke with Yulun further work needs to be done. 7/17/2001 6:18:46 PM yzhang will write a script for this problem tomorrow 7/18/2001 8:31:29 AM foconnor Customer said to close this issue they are not interested in fixing their problem. 7/18/2001 8:33:16 AM yzhang Can you collect the following: from remote site (the remote site which has the problerm) 1) the Server_id 2) echo "select element_id, name from nh_element\g" | sql nethealth > remote_elem.out from the central 1) echo "select element_id, name from nh_element\g" | sql nethealth > central_elem.out I am going to write you a script with these information Thanks Yulun 7/18/2001 8:37:35 AM yzhang this ticket was closed because customer want to set up a new cenrtral site 7/19/2001 2:50:09 PM mpoller Issue: Output of nhIndexDiag shows tables as HEAP when they should have been BTREE - Intitially this issue was multiple Db issues. - The trans_log was too small. - Rollups failed. - Lots of error messages in errlog.log - Enlarged trans_log. - Rollups stopped failing and error messages in errlog stopped appearing. - Customer also noticed the following message from the nhIndexDiag output: Problem encountered with analyzing table nh_daily_exceptions Error: Indexing problem: nh_daily_exceptions should have been btree but was HEAP. - This message appears five times in the nhIndexDiag output, each time for a different table. - Spoke with Bob concerning this. He checked his Db and saw that his tables are all BTREE. He thought we may have changed the table type structure from 4.7 to 4.8 and failed to update the nhIndexDiag accordingly. He emailed Robin Trei who said "I doubt it. I think it is more likely that the customer ran out of db work space and did not have the room to index the tables." This could very well have been resolved by raising the size of the trans_log. - Output of nhIndexDiag looks clean but the below messages still appear: Problem encountered with analyzing table nh_daily_exceptions Error: Indexing problem: nh_daily_exceptions should have been btree but was HEAP. Problem encountered with analyzing table nh_daily_health Error: Indexing problem: nh_daily_health should have been btree but was HEAP. Problem encountered with analyzing table nh_daily_symbol Error: Indexing problem: nh_daily_symbol should have been btree but was HEAP. Problem encountered with analyzing table nh_hourly_health Error: Indexing problem: nh_hourly_health should have been btree but was HEAP. Problem encountered with analyzing table nh_hourly_volume Error: Indexing problem: nh_hourly_volume should have been btree but was HEAP. - Output from df -k against the ingres partition/disk: /dev/md/dsk/d6 21432672 14173642 7044704 67% /nethdb 7/26/2001 4:34:57 PM mwickham Customer ran the convertDac and nhIndexDiag scripts. Output files from both are located on BAFS in \escalated tickets\51000\51711\26Jul01. 8/8/2001 5:33:08 PM lemmon Assigned to Yulun Zhang for evalutation 8/8/2001 5:57:38 PM yzhang has told support that the error can be ignored. because the error (indexing plain dac table in 48) don't affect operation. there is ticket for fixing this problem in nh50-beta5. Yulun 11/20/2001 11:49:31 AM yzhang problem solved P7/24/2001 10:40:54 AM cestep Environment: The customer has 5 remote pollers and one central server. They are all: Network Health 4.8 p4 d4 Solaris 2.7 Problem: Three of the servers say "Server stopped unexpectedly" at intermittent times. The three host names: fft lxe njy "njy" appears to stop unexpectedly every 3 hours and 16 minutes. So far, we do not see any other process that correlates with this interval. Received the following from the customer for each server: - Advanced logging on the Message server - System log - Errlog.log All files are on BAFS, under ticket #51999. 7/24/2001 12:07:32 PM bhinkel . 7/24/2001 12:11:11 PM rnaik Can you please get the following; -- system specifications: #elements, #nodes, #probes, memory, SWAP space. I think cust support makes the customers run some utilities to obtain system specifications and also a file called dbStatus.out , which has the dbSize and number of elements etc. -- advanced log for DbServer - Before turning on advanced logging for dbServer, can you modify their $NH_HOME/sys/debugLog.cfg to have the following arguments for nhiDbServer. program nhiDbServer { arguments "-Dm cu:dsvr:ccm:tb -Df cC< ditzZ -Dt" So stop servers modify NH_HOME/sys/debugLog.cfg start servers turn advanced loggin on for nhiDbServer (i.e database) Please give me a call if you have questions (4445) - Like u said, lets get all info while a concord person is on site. Thanks Colin, R, -----Original Message----- From: Estep, Colin Sent: Tuesday, July 24, 2001 11:15 AM To: Naik, Rupa Subject: Problem ticket #16460 Hi Rupa, I have the call ticket associated with this issue. We have someone from Concord on site today. So, I was wondering what other debug you think we might need. If I could get everything you need today, it will be easier than trying to get the customer to do it later. I appreciate any insight on this one. Thanks, Colin Estep, Senior Support Engineer 7/24/2001 1:25:40 PM cestep -----Original Message----- From: Piergallini, Anthony Sent: Tuesday, July 24, 2001 1:11 PM To: Estep, Colin Cc: Naik, Rupa Subject: RE: Problem ticket #16460 Colin/Rupa, Thanks for the quick response. Here is the start of the information you requested. The debug will take some time to collect. I will get the customer to send the debug once it has been collected if I am not here. Is there anything else that we need? Tony General Information: -------------------- The customer has a distributed polling system that consists of five remote machines and one central machine. The five remote machines a set up in different geographical locations but are all managed by the person here in Virginia. All of the machines are running eHealth 4.8 P04/D04. P04 and D04 were installed on 7/5/01. This problem was noticed by the customer on 7/16/01. I don't know if the problem is related to the patch/cert but I 'feel' that it isn't. Three of the machines (LA, NJ and Frankfurt, Germany) are experiencing unexpected stops. Machine information (LA): ------------------------- Hostname - crd.lxe HostID - 80f9aa16 RAM - 512 MB Swap - 700 MB # Elements - 3740 # Probes - 0 # Nodes - 0 DB Size - 1.203 GB This system experiences the server stop/restart every 3 hours and 15 minutes. The console has been up and running since 7/18 and this pattern repeats from that date until today. Machine information (NJ): ------------------------- Hostname - crd.njy HostID - 80f9b396 RAM - 512 MB Swap - 700 MB # Elements - 3565 # Probes - 0 # Nodes - 0 DB Size - 1.472 GB This system experiences the server stop/restart every 3 hours and 16 minutes. The console has been up and running since 7/19 and this pattern repeats from that date until today. Machine Information (Franfurt, Germany): Hostname - crd.fft HostID - 80f9ad5a RAM - 512 MB Swap - 700 MB # Elements - 1966 # Probes - 0 # Nodes - 0 DB Size - 723.755 MB This system experiences the server stop/restart every 3 hours and 42 minutes. The console has been up and running since 7/16 and this pattern repeats from that date until today. 7/25/2001 8:43:30 AM cestep Received advanced logging for the three remote sites. All files are on BAFS, under ticket #51999. Changing back to assigned. 7/25/2001 8:43:40 AM cestep - 7/25/2001 2:16:12 PM tstachowicz Have sent to Rupa: - london (working) messages.stats.log - london (working) system messages - new jersey (failing) system messages - output from de-bugging of dbServer: nhiDbServer.txt.fft nhiDbServer.txt.lxe nhiDbServer.txt.njy - number of elements on each machine: lxe (LA)= 2,700 elements njy (NJ)= 3,500 elements lhx (London)=1,220 elements tdo (tokyo)= 323 elements fft (frankfurt)=2,000 elements - system specs of all remote pollers: netra T1's 512 MB RAM 15 G disk drive all running solaris 2.7 cpu is ultrasparc-IIicEngine 7/25/2001 3:26:15 PM rnaik Tania, can you have them reproduce the problem after turn on adv.logging for the statistics poller, config server and message server.. And also get the correspondig system log. I am assigning this to Dave. Shepard. Thanks - R. 7/25/2001 4:05:41 PM tstachowicz They are not polling BMC elements. It is stopping every 3 hour and 16 minutes. They will send in the adv. logging tonight at 6:00. The fetches happen every 4 hours. There is no correlation as to why the servers are stopping and restarting every 3 hours and 16 minutes. 7/31/2001 10:28:36 AM tstachowicz advanced logging on BAFS 7/31/2001 11:39:56 AM dshepard Checking logs. 7/31/2001 5:30:53 PM dshepard The logs show nothing strange. It doesn't appear that the problem is in the CfgServer. I think we can rule that out. There is not enough info in the poller logs. We didn't get the message server logs that we requested. I suggest turning on more debug flags and getting a better poller log. If they know it happens every 3 hours and 16 minutes, have them do the following: Stop the console. Edit the $NH_HOME/sys/debugLog.cfg file as follows: Change: program nhiPoller[Net] { arguments "-Dm poller -Df dtp -Dt" } To: program nhiPoller[Net] { arguments "-Dm poller -Dfall" } Then stop and restart the Console (not the servers). 10 minutes before they expect the problem to happen, have them turn on advanced logging for the Stats Poller and send in the resulting advanced log file in $NH_HOME/log/advanced/nhiPoller_Net.txt Hopefully that will allow us to peg it as a poller problem or rule it out. 8/1/2001 9:43:21 AM tstachowicz Dave told me on July 25th that we do not need the message server in advanced logging. Do we still need it? Will we need the system logs for the corresponding time? 8/1/2001 10:28:21 AM tstachowicz sent email to customer requesting above information (not message server adv logging). 8/2/2001 11:18:49 AM tstachowicz customer has collected de-bugging from one of the sites. placed on BAFS/escalated tickets/51000/51999 8/2/2001 11:19:55 AM tstachowicz x 8/2/2001 5:27:35 PM dshepard Spent time on another escalated ticket today. Will get to this tomorrow. 8/3/2001 3:50:28 PM dshepard The advanced log they sent us showed something different from the first three we got from them. The first three they sent us showed it hung while processing a SNMP response with a NOSUCHNAME error. This new one showed them waiting for the database to load all the polled data. How long was it supposedly hung? Perhaps they killed a perfectly fine poll. Needless to say, it didn't provide the necessary information. We should have them turn on the -Dt flag in the configuration file for advanced logging as well. That will give us an idea of how long the poller spent waiting for the database. Also have them dump the poller checkpoints again when it hapens. That way we can correlate them with the advanced log file. 8/6/2001 8:40:19 AM tstachowicz The customer sent in the fft server that was stopping and starting. dave- Can you look at this one before I tell them to put the -Dt falg in the cfg file? It is up on BAFS for 8/3/01 8/7/2001 9:49:55 AM dshepard Is this thing stopping and starting, or hung? The two are very different things, and the difference is important. Given that the last two problems shown in the log files were Database related hangs (as far as I can tell), I will transfer the ticket to Yulun once you confirm for me that it was hung and did not crash. Robin has said that she wants IPM run on the machine after the DB data load is attempted at the end of the poll. 8/7/2001 9:55:27 AM dshepard I must be getting confused between a couple different tickets. If it is crashing, then I'll need more debug. I had been working off the assumption that is was hanging. But as I said, the last two logs we got from them showed something totally different from the first set. The first set showed that the last thing it did was process a SNMP NOSUCHNAME error. The second set showed that it was loading data into the database. I guess we need to turn on more debugging and try to get something consistent. Change the debug flags as follows for the stats poller: -Dall -Dt A< lso have them turn on SNMP advanced logging at the same time. Perhaps that will be enough to recreate it here (on paper at least). 8/7/2001 12:08:30 PM tstachowicz requested from customer: The de-bugging that we have received from the different pollers has given us conflicting causes. We will need more de-bugging to verify what this truly is (the poller or the database). 1.) Stop the console. Edit the $NH_HOME/sys/debugLog.cfg file as follows: Change: program nhiPoller[Net] { arguments "-Dm poller -Dfall" } To: program nhiPoller[Net] { arguments "-Dall -Dt" } Then stop and restart the Console (not the servers). 2.) 10 minutes before you expect the problem to happen, please turn on advanced logging for the Stats Poller and SNMP and send in the resulting advanced log file in $NH_HOME/log/advanced/nhiPoller_Net.txt 8/14/2001 9:54:44 AM schapman -----Original Message----- From: Chapman, Sheldon Sent: Tuesday, August 14, 2001 9:40 AM To: Shepard, Dave Subject: Problem Ticket 16460 Call Ticket 51999 Dave, The debug that you requested for this issue has been put on BAFS in the subfolder 8-10-01 for this issue 51999. Let me know if you need anything else. Thanks, Sheldon 8/14/2001 11:30:04 AM dshepard The new debug flags I requested were not enabled. As a result, this set of data doesn't tell me anything further than the last set. Back to MoreInfo. 8/14/2001 11:54:11 AM schapman . 8/21/2001 4:33:01 PM don info send was not the info requested. Re-requested the info 8/27/2001 3:05:04 PM jnormandin The requested debug has been received and placed on \\Bafs\escalated tickets\51000\51999\ August27 - The Poller debug and SNMP debug has been retreived for all 3 systems experiencing the issue: crd.lxe crd.njy crd.fft - According to customer, debug has been generated utilizing the flags specified. - Changed status back to assigned. 8/28/2001 4:40:39 PM dshepard Once again they have not enabled the correct debug flags. Please get the debugLog.cfg file from them and make sure they have the flags set correctly and are stopping and starting the console window to enable them. We are not getting timestamps on the messages from the -Dt argument. Also have them delete the pre-existing advanced logs file. They are additive and getting larger than need be. Once again this shows the Poller exiting while waiting for the database to load the data (CdbTblsStats::loadSamples). Since all of the remote machines are seeing the same thing, they can limit their work to just one of them. We need to get a system log to coincide with the poller log. That coupled with the timestamps in the poller advanced log file should show us whether it was hung for a while and how it correlates with other stuff in the system log. No need to collect SNMP advanced logs anymore. It is obviously not related to polling data given the location of the failures. Change the poller advanced log paramters to "-Dall -Dt". Perhaps that will catch problems from a Database layer failure. 8/30/2001 3:52:49 PM jnormandin - I have obtained the nhiPoller debug created using the -Dall -Dt debug flags as requested. - I verified that the debug does now contain the time stamps - Also collected the system messages as well as all of the ingres error logs - Changing status back to assigned 9/12/2001 6:32:43 PM dshepard The debug file they sent apparently showed a problem writing the latest poll timestamps to the database. Basically it was trying to load data from a binary file into the nh_stats_poll_info database table. It was terminated, possibly by another server process crashing. Here are the last few lines of the debug log. 08/30/01 12:10:36 [d,du ] Begin transaction level 1 08/30/01 12:10:36 [d,du ] Executing SQL cmd 'MODIFY nh_stats_poll_info TO TRUNCATED' ... 08/30/01 12:10:36 [d,du ] DuDatabase (execSql): errorOnNoRows: No 08/30/01 12:10:36 [Z,du ] (dbExecSql): errorOnNoRows: No 08/30/01 12:10:36 [Z,du ] (dbExecSql): sqlCmd: MODIFY nh_stats_poll_info TO TRUNCATED 08/30/01 12:10:37 [Z,du ] (dbExecSql): sqlca.sqlcode: 100 08/30/01 12:10:37 [Z,du ] (dbExecSql): rows: 0 08/30/01 12:10:37 [Z,du ] sqlca.sqlcode: 100 08/30/01 12:10:37 [Z,du ] rows: 0 08/30/01 12:10:37 [d,du ] Cmd complete, SQL code = 100 08/30/01 12:10:37 [d,du ] Committing database transaction ... 08/30/01 12:10:37 [d,du ] Committed. 08/30/01 12:10:37 [d,du ] End transaction level 1 08/30/01 12:10:37 [d,cdb ] (Stats) Poll Info deleted. 08/30/01 12:10:37 [i,cu ] Closing file = '/export/home/nethealth/tmp/STSample_999187199_14126' 08/30/01 12:10:37 [i,cu ] Close complete, status = Yes 08/30/01 12:10:37 [z,du ] Appending file /export/home/nethealth/tmp/StatsPI_999187831_999190799_14126 to table nh_stats_poll_info ... 08/30/01 12:10:37 [d,du ] Begin transaction level 1 08/30/01 12:10:37 [d,du ] Executing SQL cmd 'COPY TABLE nh_stats_poll_info () FROM '/export/home/nethealth/tmp/StatsPI_999187831_999190799_14126'' ... 08/30/01 12:10:37 [d,du ] DuDatabase (execSql): errorOnNoRows: No 08/30/01 12:10:37 [Z,du ] (dbExecSql): errorOnNoRows: No 08/30/01 12:10:37 [Z,du ] (dbExecSql): sqlCmd: COPY TABLE nh_stats_poll_info () FROM '/export/home/nethealth/tmp/StatsPI_999187831_999190799_14126' 08/30/01 12:10:37 [T,poller] E:PlrPingApi::~PlrPingApi 08/30/01 12:10:37 [T,poller] X:PlrPingApi::~PlrPingApi They should check for the following file hanging around: /export/home/nethealth/tmp/StatsPI_999187831_999190799_14126 If they find it, send it to us. If things crash again, have them look for files that are named /export/home/nethealth/tmp/StatsPI_* and send them. I am sending this to the DB group now as that appears to be the area of difficulty. 9/13/2001 12:25:05 PM rtrei Yulun-- I will need ot work with you on this. The first step is to see what the customer's poll_info table looks like-- see if we can get a copy of that. If they do have the file they were trying to load, please give then a script that load it manually, so that we can see if any errors are being swallowed. 9/14/2001 12:39:01 PM yzhang Jason, Can you collect the following customer. echo "copy table nh_stats_poll_info() into 'nh_stats_poll_info.dat' \g" | sql $NH_RDBMS_NAME echo "nh_import_poll_info() into 'nh_import_poll_info'\g" | sql $NH_RDBMS_NAME If they already have the files I am asking here, let me know so that I can give a script to have them load it 9/14/2001 3:51:34 PM yzhang Jason, do you have any luck in talking to the customer. if they still have /export/home/nethealth/tmp/StatsPI_999187831_999190799_14126, then they can run the following: echo "COPY TABLE nh_stats_poll_info () FROM '/export/home/nethealth/tmp/StatsPI_999187831_999190799_14126'\g" | sql $NH_RDBMS_NAME send us the output of this query 9/17/2001 11:03:39 AM yzhang nothing in the attached file is useful. have customer tar $NH_HOME/tmp, and send to us as soon as possible. also thi< s is problem 16460, not 16376 Thanks Yulun 9/17/2001 4:13:57 PM jnormandin - All StatsPi files and output of sql commands placed on Bafs for 51000\51999\September17 - Status back to assigned 9/18/2001 3:44:49 PM yzhang requested to run nhCollectCustData from three of the remote sites which has the problem of stopping server unexpectly every 3 hours 9/19/2001 1:51:21 PM don changed to moreinfo 9/19/2001 3:52:04 PM jnormandin The requested files have been placed on Bafs... 51000\51999\September19 - Status changed back to assigned 9/20/2001 1:45:57 PM yzhang Maintanance job on three of the remotes were failed with ingres error, have customer ran the maintanace for the three system at 11:08am today, and all of them succeeded. Told customer watch closely for the next cycle at about 2:08PM today. also the customer keep 4 weeks of stats0 table, and 0 for stats1 and stats2 table, but I noticed that the oldest stats0 table is Aug/2000. they setup system at March, so there should no table with that old timestamp. ask them to find out when the table was created and what is the number of rows in the tables 9/20/2001 1:53:53 PM jnormandin Yulun. Will the customer be providing me with that information ? Please keep me in the loop so that all tickets are updated corectley. 9/20/2001 4:18:52 PM yzhang This customer's problem is that three of the remote systems, their nhServer stops and restarts about every three hours. I have them ran nhReset at 11:00am today on each of the three remote, then the nhServer for each of them was stopped and restarted between 2:00 and 3:00. There is no entries in the errlog.log for this interrupt. This customer is going to place a drive with big disk space on each of the remote system, and install nethealth there, then load the current database. She suggested that we can close this problem ( since the problem existed for more two months without being solved), and possibly other problems currently opened with this customer. But we need to help them on installing the nethealth with the current NH_SERVER_ID, database save and load, to make sure the transfer is smooth. also they mentioned that the server interruption was occurred after upgrading to 48 patch 4. I don't know if we still want to them on patch4. I think this ticket can be de-escalated, or closed. Yulun 9/27/2001 12:02:54 PM yzhang They are transfering nethealth to the new disk with big space, we still waiting to see if everything works after the transfer. Can you vheck with the customer to see how the transfer is going. Thanks Yulun 9/28/2001 11:28:55 AM jnormandin Customer is still in the process of updating the disks. 11/28/2001 8:53:32 AM jnormandin customer has one more disk to replace ( should be done this week. ) They do not want to close the ticket unitll it is confirmed that this server is operating normally also 12/3/2001 11:41:08 AM yzhang Find out how this customer is doing 12/3/2001 11:47:53 AM yzhang problem solved 8 7/26/2001 3:09:57 PM tstachowicz ***NOTE: This is the only database save that this customer has loaded. The saves are not complete in the .tdb directory (no lanwan directory, router directory, poller.cfg, etc) Robin requested that I get a verifyDb on the table nh_daily_exceptions1000001 but the table is missing from help\g so verifydb failed. ---------------------------------------- Robin has requested: - install log - list of tables - echo "copy table nh_rpt_config\g" | sql nethealth > rpt_config.dat (I will be copying to BAFS once I receive all this) *************************************** FACTS: DATABASE SAVE: Unloading table nh_var_units . . . Unloading the sample data . . . Unloading the latest sample data definition info . . . Unloading table nh_rlp_boundary . . . Unloading table nh_stats_poll_info . . . Unloading table nh_import_poll_info . . . Unloading table nh_bsln_info . . . Unload the dac tables. . . Unloading table nh_daily_exceptions_1000001 .... Fatal Internal Error: Unable to execute 'COPY TABLE nh_daily_exceptions_1000001 () INTO '/nh/dbfull.tdb/nh_daily_exceptions_1000001'' (E_US0845 Table 'nh_daily_exceptions_1000001' does not exist or is not owned by you. (Thu Jul 26 04:53:22 2001) ). (cdb/DuTable::saveTable) DATA ANALYSIS: $NH_HOME/bin/sys/nhiDataAnalysis -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (07/26/2001 01:40:20 AM). Warning: Unexpected database error. Warning: Unexpected database error. Warning: Unexpected database error. Warning: Unexpected database error. Error: Unexpected database error. Warning: Unexpected database error. Warning: Unexpected database error. Warning: Unexpected database error. Warning: Unexpected database error. Error: Unexpected database error. LOAD.LOG: Table nh_rlp_boundary inconsistent, deleting row: type: ST stage: 0 max_range: 995227199. Table nh_rlp_boundary inconsistent, deleting row: type: ST stage: 0 max_range: 995230799. Table nh_rlp_boundary inconsistent, deleting row: type: ST stage: 0 max_range: 995234399. Table nh_rlp_boundary inconsistent, deleting row: type: ST stage: 0 max_range: 995237999. Loading the Dac tables . . . Fatal Internal Error: Uncompress of file /nh1/db.tdb/nh_daily_exceptions_1000001 failed. (cdb/DuTable::loadTable) 7/26/2001 3:56:59 PM tstachowicz collected help\g on BAFS/escalated tickets/52000/52024/07-26-01/help.txt collected install log on BAFS/escalated tickets/52000/52024/07-26-01/load07-20-01.log 7/27/2001 9:03:17 AM tstachowicz output to copying the nh_rpt_config table put on BAFS/escalated tickets/52000/52024/07-26-01/rpt_config_dat 7/27/2001 3:54:23 PM tstachowicz customer ran (per robin): echo "create table nh_daily_exceptions_100001 as select * from nh_daily_exceptions; commit\g" | sql nethealth echo "create table nh_daily_health_100001 as select * from nh_daily_health; commit\g" | sql nethealth echo "create table nh_daily_symbol_100001 as select * from nh_daily_symbol; commit\g" | sql nethealth echo "create table nh_hourly_health_100001 as select * from nh_hourly_health; commit\g" | sql nethealth echo "create table nh_hourly_volume_100001 as select * from nh_hourly_volume; commit\g" | sql nethealth Will have his schedule save and DA run tonight, will let me know the outcome. 7/30/2001 9:05:46 AM tstachowicz DATABASE SAVE FAILED AGAIN: Job started by Scheduler at '07/29/2001 03:30:12 AM'. Unload the dac tables. . . Unloading table nh_daily_exceptions_1000001 .... Fatal Internal Error: Unable to execute 'COPY TABLE nh_daily_exceptions_1000001 () INTO '/nh/dbfull.tdb/nh_daily_exceptions_1000001'' (E_US0845 Table 'nh_daily_exceptions_1000001' does not exist or is not owned by you. (Sun Jul 29 04:53:39 2001) ). (cdb/DuTable::saveTable) DATA ANALYSIS FAILED AGAIN: Job started by Scheduler at '07/29/2001 01:46:05 AM'. ----- ----- $NH_HOME/bin/sys/nhiDataAnalysis -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (07/29/2001 01:46:07 AM). Warning: Unexpected database error. Warning: Unexpected database error. Warning: Unexpected database error. Warning: Unexpected database error. Error: Unexpected database error. Warning: Unexpected database error. Warning: Unexpected database error. Warning: Unexpected database error. Warning: Unexpected database error. Error: Unexpected database error. 7/30/2001 11:24:22 AM yzhang support has sent a script to create the five splitted dac tables, after this customer will do dbsave again to see if the same message appears 7/30/2001 2:26:02 PM tstachowicz ran script. Ran save again: Unloading table nh_daily_exceptions_1000002 .... Fatal Internal Error: Unable to execute 'COPY TABLE nh_daily_exceptions_1000002 () INTO '/nh/dbfull.tdb/nh_daily_exceptions_1000002'' (E_US0845 Table 'nh_daily_exceptions_1000002' does not exist or is not owned by you. (Mon Jul 30 13:55:22 2001) ). (cdb/DuTable::saveTable) 7/30/2001 4:28:13 PM yzhang run this simple sql as soon as possible, and send me the table.out. the rpt_config.dat I got from escalatio< n directory can not be loaded. echo "select table_name, num_rows from iitables order by table_name\g" | sql $NH_RDBMS_NAME > table.out Yulun 7/30/2001 4:53:59 PM yzhang Tom, Can you also collect the following: login as nh_user 1) nhiDataAnalysis -Dall -d $NH_RDBMS_NAME -u $NH_USER >& nhiDataAnalysis 2) nhCollectCustData $NH_RDBMS_NAME this will create nhCollect.tar in $NH_HOME/tmp, send this tar file. Thanks Yulun 7/31/2001 5:01:59 PM yzhang Ok, the script now in concord ftp site ~ftp/outgoing, make sure apply asii bit when you ftp 8/2/2001 10:29:59 AM yzhang Tania, I noticed that customer can do the dbsave now, because the splited dac tables were there. I just want to know if you have the customer created all of the splited dac tables, such as nh_daily_health_1000004. or it created from some where else. Thanks Yulun 8/2/2001 11:45:19 AM yzhang can you have customer run ~/yzhang/scripts/16590.sh, and send me 16590.out from $nh_home/tmp. If the have problem running the script you emialed, you might want to place the script on ftp.concord.com/outgoing directory. The script will delete 5 unused tables, and check if they really have the duplicate in the stats tables, and the script has been tested Thanks Yulun 8/2/2001 3:28:38 PM yzhang The stats tables have not been indexed, can you do the following login as nhuser and source nethealthrc.csh cd $NH_HOME/bin/sys ./nhiIndexStats -Dall >& nhiIndexStats.out 8/2/2001 6:22:44 PM yzhang asked run a script to index stats1 and stats2 table, then run stats rollup followed by data analysis 8/3/2001 1:37:49 PM yzhang login as nhuser, and source nethealthrc.csh, the do the following: 1) echo "create unique index nh_stats0_996854399_ix1 on nh_stats0_996854399 (sample_time, element_id) WITH STRUCTURE = BTREE, FILLFACTOR = 100, LEAFFILL = 100, NONLEAFFILL = 100\g" | sql nethealth echo "create unique index nh_stats0_996854399_ix2 on nh_stats0_996854399 (element_id, sample_time)WITH STRUCTURE = BTREE, FILLFACTOR = 100, LEAFFILL = 100, NONLEAFFILL = 100\g" | sql nethealth 2) nhiRollupDb 3) nhiDataAnalysis let me know the result Thanks Yulun 8/3/2001 5:16:56 PM yzhang The problem is that the spliited dac tables need to be indexed. grab the script named index_splited_dac.sh from ~ftp/outgoing of concord ftp site, and run the script using the following command: index_splited_dac.sh > index_dac.out after this send me : echo "help\g" | sql nethealth >new_table.out Thanks Yulun 8/9/2001 9:55:09 AM tstachowicz All files exist on the file system. 8/9/2001 12:07:14 PM yzhang If all files are existed, what I can say is to check the permiision for the files and directories, also to see if they keep getting the same error each time, or it is just a one time error. 8/15/2001 9:10:48 AM yzhang run dbsave at advanced debug, and send the debug output so I can locate where the files copied to. Thanks Yulun 8/23/2001 10:47:11 AM yzhang problem solved N8/8/2001 11:13:36 AM rkeville Customer has Traffic Accountant and Stats polling on the same system. nh_element is over 2 gb in size, other tables may be oversize as well. - Database is loading on marble. - nh_element has 2096994 rows. Customer wants to keep all data except for one probe, element_id 3069331. ############################################################# 8/8/2001 11:29:21 AM yzhang Bob is loading the database now, I will take a look after the loading 8/10/2001 3:00:55 PM yzhang customer agree to remove all TA data, so the problem was solved ^F8/13/2001 3:33:21 PM rrick Problem: vibes.mirrorimage.net% ./nhiRollupDb Begin processing (08/13/2001 09:43:58 AM). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Mon Aug 13 09:48:30 2001) What has been done: I have tried to re-index the DB. We also get a missing index for nh_job_schedule, which I wrote a script to re-build this table and each time the customer reruns the rollup he gets the above error. I cannot seem to locate what is duplicated. All files are located on //BAFS/Escalated Tickets/52000/52355 8/13/2001 5:20:13 PM yzhang run the following from command line after you login as nhuser and sourced nethealthrc.csh, send me the dup.out. Thanks Yulun echo "select sample_time, element_id, count (*) from nh_stats0_996778799 group by sample_time, element_id having count (*) > 1;select element_id, sample_time, count (*) from nh_stats0_996778799 group by element_id, sample_time having count (*) > 1\g" | sql $NH_RDBMS_NAME >> dup.out 8/14/2001 2:59:44 PM rrick From: Rick, Russell Sent: Tuesday, August 14, 2001 2:46 PM To: 'larry@mirror-image.com' Subject: Concord Communications Nethealth Support Reply..........RE: Call Ticket #52355 & Problem Ticket #17078 Hi Larry, Instructions: Please execute the following procedure: 1. Execute the following from the $NH_HOME/bin directory: nhSaveDb -p -u 2. Please go to Concord Communications FTP site at ftp@concord.com. Login: Userid = anonymous Password = your email address 3. Please put the *.tdb directory, from the nhSaveDb out to ftp.concord.com/incoming directory. In the incoming directory, please make a sub-directory called Mirror Image, so that I can identify the output. If you have any problems or the issues, please feel free to contact support@concord.com, Attn: Russ Rick. My support hours are 11:30am - 8:00pm, est. Regards, Russell K. Rick, Senior Support Engineer Domestic and International Customers Group Concord Communications, Inc. http://www.concord.com 600 Nickerson Road Marlboro, Ma. 01752 USA Toll Free: 888-832-4340 or 508-460-4646 Fax: 508-303-4343 Intl: 508-303-4300 Technical Support hours are 5:00 AM to 8:00 PM EST, Monday through Friday. Please address all responses to support@concord.com and include your assigned call ticket number in the subject field. 8/14/2001 3:35:33 PM rrick From: Larry Finn [mailto:lfinn@mirror-image.com] Sent: Tuesday, August 14, 2001 3:20 PM To: 'Rick, Russell' Subject: RE: Concord Communications Nethealth Support Reply..........RE: Call Ticket #52355 & Problem Ticket #17078 Russ- Got some funky errors when I ran the script. LF vibes.mirrorimage.net% pwd /opt/health vibes.mirrorimage.net% cd bin vibes.mirrorimage.net% nhSaveDb -p /opt/health/idb/save -u neth nethealth See log file /opt/health/log/save.log for details... vibes.mirrorimage.net% more save.log Begin processing (08/14/2001 03:20:37 PM). Copying relevant files (08/14/2001 03:20:38 PM). Unloading the data into the files, in directory: '/opt/health/idb/save.tdb/'. . . Unloading table nh_alarm_rule . . . Unloading table nh_alarm_threshold . . . Unloading table nh_bsln_info . . . Unloading table nh_bsln . . . Unloading table nh_daily_exceptions . . . Unloading table nh_daily_health . . . Unloading table nh_daily_symbol . . . Unloading table nh_assoc_type . . . Unloading table nh_elem_latency . . . Unloading table nh_element_class . . . Unloading table nh_elem_outage . . . Unloading table nh_elem_alias . . . Unloading table nh_element_ext . . . Unloading table nh_elem_analyze . . . Unloading table nh_enumeration . . . Unloading table nh_exc_profile_assoc . . . Unloading table ex_tuning_info . . . Unloading table exception_element . . . Unloading table exception_text . . . Unloading table nh_exc_profile . . . Unloading table ex_thumbnail . . . Unloading table nh_list . . . Unloading table nh_list_item . . . Unloading table hdl . . . Unloading table nh_hourly_health . . . Unloading table nh_hourly_volume . . . Unloading table nh_elem_latency . . . Unloading table nh_col_expression . . . Unloading table nh_element_type . . . Unloading table nh_elem_type_enum . . . Unloading table nh_elem_type_var . . . Unloadin< g table nh_variable . . . Unloading table nh_mtf . . . Unloading table nh_address . . . Unloading table nh_node_addr_pair . . . Unloading table nh_nms_defn . . . Unloading table nh_elem_assoc . . . Unloading table nh_job_step . . . Unloading table nh_list_group . . . Unloading table nh_run_schedule . . . Unloading table nh_run_step . . . Unloading table nh_job_schedule . . . Fatal Internal Error: Unable to execute 'COPY TABLE nh_job_schedule () INTO '/op t/health/idb/save.tdb/nsc_b45" (E_US0845 Table 'nh_job_schedule' does not exist or is not owned by you. (Tue Aug 14 15:20:46 2001) ). (cdb/DuTable::saveTable) vibes.mirrorimage.net% -----Original Message----- From: Rick, Russell [mailto:RRick@concord.com] Sent: Tuesday, August 14, 2001 2:46 PM To: 'larry@mirror-image.com' Subject: Concord Communications Nethealth Support Reply..........RE: Call Ticket #52355 & Problem Ticket #17078 Hi Larry, Instructions: Please execute the following procedure: 1. Execute the following from the $NH_HOME/bin directory: nhSaveDb -p /opt/health/idb/save -u neth nethealth 2. Please go to Concord Communications FTP site at ftp@concord.com. Login: Userid = anonymous Password = your email address 3. Please put the *.tdb directory, from the nhSaveDb out to ftp.concord.com/incoming directory. In the incoming directory, please make a sub-directory called Mirror Image, so that I can identify the output. If you have any problems or the issues, please feel free to contact support@concord.com , Attn: Russ Rick. My support hours are 11:30am - 8:00pm, est. Regards, Russell K. Rick, Senior Support Engineer Domestic and International Customers Group Concord Communications, Inc. http://www.concord.com 600 Nickerson Road Marlboro, Ma. 01752 USA Toll Free: 888-832-4340 or 508-460-4646 Fax: 508-303-4343 Intl: 508-303-4300 Technical Support hours are 5:00 AM to 8:00 PM EST, Monday through Friday. Please address all responses to support@concord.com and include your assigned call ticket number in the subject field. This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. This footnote also confirms that this email message has been swept by the latest virus scan software available for the presence of computer viruses. Sent to Yulun. 8/14/2001 4:12:05 PM yzhang Russell, they need to reconstruct the nh_job_schedule table. you need to ftp ~yzhang/script/nh_job_schedule.dat into ~ftp/outgoing, and have him download, then run the following : be aware that nh_job_schedule.dat is a binary file, so the ftp sould be issued with bin mode. echo "delete from nh_job_schedule; commit\g" | sql $NH_RDBMS_NAME echo "copy nh_job_schedule() from 'nh_job_schedule.dat';commit\g" | sql $NH_RDBMS_NAME Yulun 8/14/2001 4:46:17 PM yzhang Russell, Here is the complete step to reconstruct nh_job_schedule table: 1) run ~yzhang/scripts/create_job_schedule_17078.sh be aware this is ascii file, and use ascii when ftp, tell customer about this if he uses ftp for this file 2) ftp ~yzhang/script/nh_job_schedule.dat into ~ftp/outgoing, and have him download and place in the directory where he will run the following command, then run the following: be aware that nh_job_schedule.dat is a binary file, so the ftp sould be issued with bin mode. echo "copy nh_job_schedule() from 'nh_job_schedule.dat';commit\g" | sql $NH_RDBMS_NAME do dbsave after succeeding at above You might want to work with him over the phone. Thanks Yulun 8/14/2001 5:55:47 PM rrick From: Rick, Russell Sent: Tuesday, August 14, 2001 5:41 PM To: 'larry@mirror-image.com' Subject: RE: Concord Communications Nethealth Support Reply..........RE: Call Ticket #52355 & Problem Ticket #17078 Hi Larry, Comments: Please follow the procedure below to reconstruct nh_job_schedule table. Instructions: Please perform the following: 1. Shutdown the Nethealth Console, by exiting the GUI. 2. Login as the nethealth user. 3. Execute the following command at the command line in the $NH_HOME directory: source nethealthrc.csh 4. Shutdown the Nethealth Server, by executing the following commands at the command line: cd $NH_HOME nhServer stop 5. Please download or FTP, in ascii format, this attached file of the "create_job_schedule_17078.sh" script to the $NH_HOME directory: 6. Execute the "create_job_schedule_17078.sh" script at the command line in the $NH_HOME directory. 7. Please go to Concord Communications FTP site at ftp@concord.com/outgoing directory. Login: Userid = anonymous Password = your email address 8. Please download or FTP, in binary format, the "nh_job_schedule.dat" file into the $NH_HOME directory. 9. Please download or FTP, in ascii format, this attached file of the "copy_job_schedule_17078.sh" script to the $NH_HOME directory: 10. Execute the "copy_job_schedule_17078.sh" script at the command line in the $NH_HOME directory. 11. Execute the following command in the $NH_HOME/bin directory: nhSaveDb -p -u 12. Please forward the nhSaveDb log to support@concord.com, Attn: Russ Rick. If you have any problems or the issues, please feel free to contact support@concord.com, Attn: Russ Rick. My support hours are 11:30am - 8:00pm, e.s.t. Regards, Russell K. Rick, Senior Support Engineer Domestic and International Customers Group Concord Communications, Inc. http://www.concord.com 600 Nickerson Road Marlboro, Ma. 01752 USA Toll Free: 888-832-4340 or 508-460-4646 Fax: 508-303-4343 Intl: 508-303-4300 Technical Support hours are 5:00 AM to 8:00 PM EST, Monday through Friday. Please address all responses to support@concord.com and include your assigned call ticket number in the subject field. -----Original Message----- From: Rick, Russell Sent: Tuesday, August 14, 2001 2:46 PM To: 'larry@mirror-image.com' Subject: Concord Communications Nethealth Support Reply..........RE: Call Ticket #52355 & Problem Ticket #17078 Hi Larry, Instructions: Please execute the following procedure: 1. Execute the following from the $NH_HOME/bin directory: nhSaveDb -p -u 2. Please go to Concord Communications FTP site at ftp@concord.com. Login: Userid = anonymous Password = your email address 3. Please put the *.tdb directory, from the nhSaveDb out to ftp.concord.com/incoming directory. In the incoming directory, please make a sub-directory called Mirror Image, so that I can identify the output. If you have any problems or the issues, please feel free to contact support@concord.com, Attn: Russ Rick. My support hours are 11:30am - 8:00pm, est. Regards, Russell K. Rick, Senior Support Engineer Domestic and International Customers Group Concord Communications, Inc. http://www.concord.com 600 Nickerson Road Marlboro, Ma. 01752 USA Toll Free: 888-832-4340 or 508-460-4646 Fax: 508-303-4343 Intl: 508-303-4300 Technical Support hours are 5:00 AM to 8:00 PM EST, Monday through Friday. Please address all responses to support@concord.com and include your assigned call ticket number in the subject field. 8/15/2001 12:53:41 PM rrick From: Larry Finn [mailto:lfinn@mirror-image.com] Sent: Wednesday, August 15, 2001 9:46 AM To: 'Rick, Russell' Subject: RE: Concord Communications Nethealth Support Reply..........RE: Call Ticket #52355 & Problem Ticket #17078 Russ- I got an error when I ran the script. vibes.mirrorimage.net% sh create_job_schedule_17078.sh create_job_schedule_17078.sh: syntax error at line 39: 'end of f< ile' unexpected vibes.mirrorimage.net% From: Rick, Russell Sent: Wednesday, August 15, 2001 12:39 PM To: 'larry@mirror-image.com' Subject: RE: Concord Communications Nethealth Support Reply..........RE: Call Ticket #52355 & Problem Ticket #17078 Larry, Here is a new copy of create_job_schedule_17078.sh. Please replace this version with the other that you have and re-run the procedure, again. Sorry about that! If you have any problems or the issues, please feel free to contact support@concord.com, Attn: Russ Rick. My support hours are 11:30am - 8:00pm, e.s.t. Regards, Russell K. Rick, Senior Support Engineer Domestic and International Customers Group Concord Communications, Inc. http://www.concord.com 600 Nickerson Road Marlboro, Ma. 01752 USA Toll Free: 888-832-4340 or 508-460-4646 Fax: 508-303-4343 Intl: 508-303-4300 8/15/2001 3:16:33 PM rrick From: Larry Finn [mailto:lfinn@mirror-image.com] Sent: Wednesday, August 15, 2001 2:50 PM To: 'Rick, Russell' Subject: RE: Concord Communications Nethealth Support Reply..........RE: Call Ticket #52355 & Problem Ticket #17078 Same error, different line. vibes.mirrorimage.net% source nethealthrc.csh vibes.mirrorimage.net% pwd /opt/health vibes.mirrorimage.net% nhServer stop Stopping Network Health servers. vibes.mirrorimage.net% sh create_job_schedule_17078.sh create_job_schedule_17078.sh: syntax error at line 36: 'end of file' unexpected vibes.mirrorimage.net% From: Rick, Russell Sent: Wednesday, August 15, 2001 2:55 PM To: 'larry@mirror-image.com' Subject: RE: Concord Communications Nethealth Support Reply..........RE: Call Ticket #52355 & Problem Ticket #17078 Please try this copy. I have gone through this line by line. This one should work. If you have any problems or the issues, please feel free to contact support@concord.com, Attn: Russ Rick. My support hours are 11:30am - 8:00pm, e.s.t. Regards, Russell K. Rick, Senior Support Engineer Domestic and International Customers Group Concord Communications, Inc. http://www.concord.com 600 Nickerson Road Marlboro, Ma. 01752 USA Toll Free: 888-832-4340 or 508-460-4646 Fax: 508-303-4343 Intl: 508-303-4300 Technical Support hours are 5:00 AM to 8:00 PM EST, Monday through Friday. Please address all responses to support@concord.com and include your assigned call ticket number in the subject field. 8/15/2001 4:01:16 PM rrick From: Larry Finn [mailto:lfinn@mirror-image.com] Sent: Wednesday, August 15, 2001 3:40 PM To: 'Rick, Russell' Subject: RE: Concord Communications Nethealth Support Reply..........RE: Call Ticket #52355 & Problem Ticket #17078 Russ- The SaveDb script is cranking away. I'll send you the output when it finishes. LF 8/17/2001 3:30:52 PM yzhang Just have customer place database in the format of tar file in ~ftp/incoming, then you might want to load the db in the same system and the same nh version in your machine. Then I will come up to continue. Thanks Yulun 8/17/2001 6:01:30 PM rrick From: Larry Finn [mailto:lfinn@mirror-image.com] Sent: Friday, August 17, 2001 3:46 PM To: 'Rick, Russell' Subject: RE: Concord Communications Nethealth Support Reply..........RE: Call Ticket #52355 & Problem Ticket #17078 It's on the way. Mirror.tar LF Yulun, The save.tdb is located on //BAFS/escalated tickets/52000/52355/mirror.tar 8/23/2001 7:57:13 AM mmcnally -----Original Message----- From: McNally, Mike Sent: Thursday, August 23, 2001 7:43 AM To: Zhang, Yulun Subject: 17078 "Stats rollup failure" Yulun, Do you have an update to this ticket? Thanks, Mike 8/28/2001 9:40:26 AM mmcnally -----Original Message----- From: McNally, Mike Sent: Tuesday, August 28, 2001 9:26 AM To: Zhang, Yulun Subject: 17078 "Stats rollup failure" Yulun, Do you have an update to this ticket? Thanks, Mike 8/28/2001 1:23:27 PM yzhang Load the customer's database in the same platform and version of nethealth, then run dbrollup. let me know when you see the error message. Thanks Yulun 9/4/2001 10:57:28 AM yzhang What is the system name, nh_home and daname. I want to rlogin from my system. Thanks Yulun 9/4/2001 11:28:18 AM mmcnally -----Original Message----- From: McNally, Mike Sent: Tuesday, September 04, 2001 10:47 AM To: Zhang, Yulun Subject: RE: PT 17078 Database Ingres Rollups failing Yulun, Below is the requested info. Thanks, Mike System name=zinc NH_HOME=/export/zinc1/nethealth NH_RDBMS_NAME=nethealth 9/5/2001 4:58:25 PM yzhang loaded customer db in house, and now trying to reproduce the problem 9/13/2001 2:45:26 PM mmcnally -----Original Message----- From: McNally, Mike Sent: Thursday, September 13, 2001 2:30 PM To: Zhang, Yulun Subject: PT 17078 "Rollups failing" Yulun, Have you been able to reproduce the error? Thanks, Mike 10/3/2001 1:04:00 PM yzhang Bob, This problem ticket was closed, you can close the corresponding call ticket. Customer 's database get corrupted, and they did upgrade to 48, now they are up and running. Yulun #9/4/2001 11:00:33 PM rrick Initial problem: NH_HOME/bin/sys/nhiIndexStats -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (09/04/2001 12:20:03 PM). Internal Error: Unable to connect to database 'nethealth' (E_LQ0001 Failed to connect to DBMS session. E_LC0001 GCA protocol service (GCA_REQUEST) failure. Internal service status E_GC0345 -- Server not accepting connections because of maintenance or shut down. Try another server, have the administrator re-open this one, or start a new one.. ). (du/DuDatabase::dbConnect) ----- Scheduled Job ended at '09/04/2001 12:20:03 PM'. NOTE: This occurred on some reports, nhSaveDb, nhCollectCustData, etc Additional info: nhSaveDb, nhCollectCustData seem to be running, but not using any cpu resources. nhSaveDb continues to process, but does nothing....just hangs. Always hangs at the same stats2 table....nh_stats2_970376399 when looking in the *.tdb directory. Additional files in //bafs/escalated tickets/53000/53753 9/5/2001 5:17:19 PM rrick -----Original Message----- From: jmannie@qwest.com [mailto:jmannie@qwest.com] Sent: Wednesday, September 05, 2001 3:29 PM To: RRick@concord.com Subject: RE: Concord Communications Nethealth Support Reply..........RE: C all Ticket #53753 & Problem Ticket #17637 Russell, It looks like the process has stopped, file in not getting anymore data, I thought I would send you what it has and if the file gets more data I will let you know but it stopped at the same file again. Thanks, John Mannie Qwest Communications Total Care Network Management Systems (612) 664-3856 DebugAdditional files in //bafs/escalated tickets/53000/53753 9/7/2001 11:03:26 AM yzhang Russell, have customer try the following, then send me the two out file and iivdb.log 1) stop ingres, nhserver, then start ingres, check if the four ingres processes are running ps -ef | grep ing > ing_process.out 2) echo " help table nh_stats2_970981199\g" | sql $NH_RDBMS_NAME > table nh_stats2_970981199.out 3) verifydb -mreport -sdbname $NH_RDBMS_NAME -otable nh_stats2_970981199 9/7/2001 11:49:48 AM rrick -----Original Message----- From: jmannie@qwest.com [mailto:jmannie@qwest.com] Sent: Friday, September 07, 2001 12:10 AM To: RRick@concord.com Subject: RE: Concord Communications Nethealth Support Reply..........RE: C all Ticket #53753 & Problem Ticket #17637 Russell, I found this in /tmp/nhStartDb.log, not sure if it will help or not. mds3mrep1% more nhStartDb.log nhStartDb invoked on Thu Sep 6 15:13:23 CDT 2001 Starting OpenIngres servers on Thu Sep 6 15:13:23 CDT 2001 ...started successfully. INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. Ingres SPARC SOLARIS Version OI 2.0/9712 (su4.us5/00) login Thu Sep 6 16:14:14 2001 continue * Ingres Version OI 2.0/9712 (su4.us5/00) logout Thu Sep 6 16:14:1< 4 2001 Sysmoding database 'nethealth' . . . Modifying 'iiattribute' . . . E_US1208 Duplicate records were found. (Thu Sep 6 16:14:20 2001) Sysmod of database 'nethealth' abnormally terminated. Exitting nhStartDB with status 17 Thanks, John Mannie -----Original Message----- From: jmannie@qwest.com [mailto:jmannie@qwest.com] Sent: Friday, September 07, 2001 12:19 AM To: RRick@concord.com Subject: RE: Concord Communications Nethealth Support Reply..........RE: C all Ticket #53753 & Problem Ticket #17637 Russell, I also found this file from Aug 31st in /opt "-rw-rw-rw- 1 root other 450276 Aug 31 12:58 restoresymtable" this is the day & time we swapped out hard drive. -John Mannie -----Original Message----- From: Zhang, Yulun Sent: Friday, September 07, 2001 10:52 AM To: Rick, Russell Subject: prob. 17637 Russell, have customer try the following, then send me the two out file and iivdb.log 1) stop ingres, nhserver, then start ingres, check if the four ingres processes are running ps -ef | grep ing > ing_process.out 2) echo " help table nh_stats2_970981199\g" | sql $NH_RDBMS_NAME > table nh_stats2_970981199.out 3) verifydb -mreport -sdbname $NH_RDBMS_NAME -otable nh_stats2_970981199 -----Original Message----- From: Rick, Russell Sent: Friday, September 07, 2001 11:31 AM To: 'jmannie@qwest.com' Subject: RE: Concord Communications Nethealth Support Reply..........RE: C all Ticket #53753 & Problem Ticket #17637 Good Morning John, Please execute the following: 1) Stop ingres 2) Stop nhserver 3) Start ingres 4) Check if the four ingres processes are running by executing the following command: ps -ef | grep ing > ing_process.out 5) Execute the following at the $NH_HOME directory: echo "help table nh_stats2_970981199\g" | sql $NH_RDBMS_NAME > table nh_stats2_970981199.out 6) Execute the following command from $NH_HOME/idb/ingres/bin/ verifydb -mreport -sdbname $NH_RDBMS_NAME -otable nh_stats2_970981199 7) Please forward the output from 4) & 5) Thanks again for your patience, - Russ 9/7/2001 6:03:48 PM yzhang created an with CA with the following message We have a customer, who's dbsave was failed due to a dead table. the table name is nh_stats2_970981199. I have asked customer do the followings help table nh_stats2_970981199\g select count (*) from nh_stats2_970981199\g copydb -c nethealth nh_stats2_970981199 verifydb -mreport -sdbname nethealth -otable nh_stats2_970981199 but all of them hangs, looks there is no way they can access the table. we know the table is there. customer don't want to drop it (I even don't know if it can be dropped) because the table contains valuble information. What I can do to recover the table. Thanks Yulun here is the part of errlog.log 9/10/2001 12:37:40 PM yzhang Can you both work with customer following the example instruction below to remove the deadlock. do a practice before instructing customer, let me know if you have problem. Thanks Yulun 9/10/2001 5:19:04 PM yzhang Sheldon, I think Jason is out for today, I just wonder if you can get the following from customer: from system : vusno251.epi.tcxf.in.telstra.com.au echo "copy table nh_stats0_998989199() into 'nh_stats0_998989199.dat'\g" | sql $NH_RDBMS_NAME send the nh_stats0_998989199.dat, be aware of that this is a binary file, it needs bin mode when transfering with ftp. Thanks Yulun 9/10/2001 5:47:00 PM jpoblete Yulun, DB save worked fine customer was able to save the DB without errors, he asked to close this one. 9/11/2001 9:42:42 AM yzhang dbsave is ok now 9/6/2001 1:08:23 PM foconnor nhSaveDb fails with segmentation fault errrors. Customer has not been able to save database for several months. save.log file looks like there are no errors but it does not appear to finish unloading tables. Transaction log is 700 MB From the errlog.log (see also //BAFS/escalated tickets/49000/49877/Db/51224_IngresDB: 00000124 Wed Aug 29 14:41:56 2001 E_DM9206_BM_BAD_PAGE_NUMBER Page number on page doesn't match its location. 00000124 Wed Aug 29 14:41:56 2001 E_DM93A7_BAD_FILE_PAGE_ADDR Page 364 in table nh_run_step, owner: nethealth, database: nethealth, has an incorrect page number: 0. Other page fields: page_stat 00000000, page_log_address (00000000,00000000), page_tran_id (0000000000000000). Corrupted page cannot be read into the server cache. 00000124 Wed Aug 29 14:41:56 2001 E_DM920C_BM_BAD_FAULT_PAGE Error faulting a page. 00000124 Wed Aug 29 14:41:56 2001 E_DM9C83_DM0P_CACHEFIX_PAGE An error occurred while fixing a page in the buffer manager. 00000124 Wed Aug 29 14:41:56 2001 E_DM9204_BM_FIX_PAGE_ERROR Error fixing a page. 00000124 Wed Aug 29 14:41:56 2001 E_DM9206_BM_BAD_PAGE_NUMBER Page number on page doesn't match its location. 00000124 Wed Aug 29 14:41:56 2001 E_DM904C_ERROR_GETTING_RECORD Error getting a record from database:nethealth, owner:nethealth, table:nh_run_step. 00000124 Wed Aug 29 14:41:56 2001 E_DM008A_ERROR_GETTING_RECORD Error trying to get a record. Associated error messages which provide more detailed information about the problem can be found in the error log (errlog.log) 00000124 Wed Aug 29 14:41:56 2001 E_QE007C_ERROR_GETTING_RECORD Error trying to get a record. Associated error messages which provide more detailed information about the problem can be found in the error log (errlog.log) MANAGER ::[II\INGRES\da , 000000aa]: Wed Aug 29 14:42:40 2001 E_SC0271_EVENT_THREAD The SCF alert subsystem event thread has been altered. The operation code is 0 (0 = REMOVE, 1 = ADD, 2 = MODIFY). MANAGER ::[II\INGRES\da , ffffffff]: Wed Aug 29 14:42:41 2001 E_SC0235_AVERAGE_ROWS On 549. select/retrieve statements, the average row count returned was 7. MANAGER ::[II\INGRES\da , ffffffff]: Wed Aug 29 14:42:41 2001 E_SC0128_SERVER_DOWN Server Normal Shutdown. MANAGER ::[II\INGRES\da , ffffffff]: Wed Aug 29 14:42:41 2001 E_CL2518_CS_NORMAL_SHUTDOWN The Server has terminated normally. ::[II_ACP , 000000d3]: Wed Aug 29 14:42:51 2001 E_DM9815_ARCH_SHUTDOWN Archiver was told to shut down. MANAGER ::[ , 00000000]: Wed Aug 29 14:42:52 2001 E_GC0152_GCN_SHUTDOWN Name Server normal shutdown. 9/6/2001 1:42:07 PM yzhang Farrell, Run a resetRunStep.sh from escalated/scripts, and follow the direction for running the script, then do a db save 9/6/2001 2:31:52 PM foconnor Sent script 9/18/2001 6:00:00 PM yzhang Farrell, Any update on this one 9/21/2001 5:42:49 AM foconnor -----Original Message----- From: O'Connor, Farrell Sent: Friday, September 21, 2001 5:26 AM To: Zhang, Yulun Cc: O'Connor, Farrell Subject: FW: Call ticket 51224 Yulun, He ran the resetRunStep and only got the run_step.dat fi< le 10/9/2001 10:22:06 AM yzhang Bob, Can you check the current status from customer reagrding this. Yulun 10/16/2001 3:36:04 PM yzhang Bob, Check if they still have the same problem, run a nhCollectCustData. Thanks Yulun 10/26/2001 4:35:44 PM yzhang f we do not hear back from you by the end of business 10-22-2001, we will close your call ticket. At that point, you can choose to re-open your incident and get a new call ticket number assigned to you by calling Technical Support at 1-888-832-4340. 10/26/2001 4:37:46 PM yzhang Bob, this problem ticket was closed because we have not heared anything from customer in more than two weeks. please close the associated call ticket Yulun 10/29/2001 4:21:55 PM yzhang Bob, you can try to copy these two tables (nh_run_step and nh_system_log) into files (keep th files), then drop, create ,index and reload the data. because the page number and loacation for those tables are corrupted. if this does not solve the problem, you might want to create an issue with CA. Thanks Yulun 10/30/2001 11:51:34 AM mfintonis Hi Yulun, Sorry to do this to you but the reseller replied, they are still having the problem with archiving the database with a db save. The errlog.log and save.log are on the escalated tickets dir with the name 10-29-2001. -Bob -------------------------------------------------------------------------------------------------- Yulun's reply to above email from Bob: Monday, October 29, 2001 4:21:55 PM yzhang Bob, you can try to copy these two tables (nh_run_step and nh_system_log) into files (keep th files), then drop, create ,index and reload the data. because the page number and loacation for those tables are corrupted. if this does not solve the problem, you might want to create an issue with CA. Thanks Yulun 11/1/2001 10:37:08 PM rkeville Wrote script and sent it to customer. 11/8/2001 12:24:48 PM yzhang problem solved, and ticket closed 11/8/2001 12:26:12 PM yzhang problem solved, and ticket closed + 9/7/2001 11:11:29 AM cpaschal Database maintence (nhReset -db) fails Siemens Business Services is running NH 4.8 p2 d2 on a Solaris 2.8 server, which is having the Maintenance job hang. This currently forces the customer to stop and re-start NH whenenever the job runs. They changed when maintenance was scheduled from Sunday to Monday. This didn't help the hanging, but allowed them to be able to react to the problem quicker. They disabled maintenance so that it wouldn't run. They decided to be proactive in the server cycling and put it in a cron job set for Monday morning. (The cron job is doing nothing for them at this point since maintenance has been disabled.) Amazingly enough, the only problem with the System appears to be the failing of maintenance. Everything else on the system is working fine according to Herman. Errlog.log shows the following error and stack dumps: COL-ERL1::[42916 , 00000024]: Wed Aug 15 01:45:31 2001 E_DM9C75_DM2D_CLOSE_TCBBUSY Table Descriptor unexpectedly busy at database close time. The TCB for table (iirelation, $ingres) of database nethealth was found referenced while the database was being closed: tcb_ref_count 2, tcb_valid_count 2, tcb_status 0x00000000. === Maintenance Log error: Network Health processes are not running nhServer requires an existing, accessible database. However, INGRES returned the following error when an attempt was made to access 'nethealth': E_LQ0009 Communications or transmission error received without text. The generic error number is 39100, the local error is 590342, and the number of parameters is 0 (should be 1). Please specify an existing database: Make sure that the database storage is mounted and accessible. === Running nhReset -db manually shows the following errors: Error starting OpenIngres Servers: INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. E_LC0001 GCA protocol service (GCA_REQUEST) failure. Internal service status E_GC0025 -- Unable to connect to Name Server: Name Server ID incorrect. E_LQ0001 Failed to connect to DBMS session. ========================================================= Please fix the problem and try running nhStartDb again. + sleep 5 + /nethealth/bin/nhServer start nhServer requires INGRES to be running and accessible, but it doesn't seem to be. Please correct the following error returned by INGRES: E_LC0001 GCA protocol service (GCA_REQUEST) failure. Internal service status E_GC0025 -- Unable to connect to Name Server: Name Server ID incorrect. E_LQ0001 Failed to connect to DBMS session. All files can be found on \\bafs\escalated tickets\51000\51957 CollectCustData output is under Aug28th 10/31/2001 7:18:59 AM cestep This problem was resolved after loading patch 6. 19/7/2001 5:48:48 PM rrick - Error in errlog.log: nethealt::[32809 IIGCN, 00000000]: Fri Aug 31 14:13:38 2001 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. nethealt::[32809 IIGCN, 00000000]: Wed Sep 5 11:11:31 2001 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. nethealt::[32809 IIGCN, 00000000]: Wed Sep 5 11:18:18 2001 E_GC0139_GCN_NO_DBMS No DB ::[II_RCP , 00000001]: Thu Sep 6 14:25:45 2001 E_DM9006_BAD_FILE_WRITE Disk file write error on database:nethealth table:iiattribute pathname:/opt/nethealth/idb/ingres/data/default/nethealth filename:aaaaaaad.t00 page:2989 write() failed with operating system error 0 (Error 0) - Cannot change tranaction log size. - Enough disk space - Cannot bring up ingres.....recovery server fails. - nhCollectCustData fails....no dbms servers All files in BAFS/escalated tickets/53000/53869 9/7/2001 6:00:50 PM schapman Escalated at Yulun's request 9/10/2001 12:14:38 PM yzhang Russell, I think the customer need to reconfigu their kernel based on the following instruction Ingstart or nhStartDb fails with the following error messages. Checking host "rrdnms1" for system resources required to run Ingres... 11173888 byte shared memory segment required by LG/LK sub-systems. 278528 byte shared memory segment required by DBMS server(s). 0 bytes is the maximum shared memory segment size. 3 shared memory segments required. 0 is the total number of shared memory segments allocated by the system. 0 shared memory segments are currently available. Your system does not have sufficient resources to run Ingres as configured. 9/10/2001 1:05:33 PM rrick -----Original Message----- From: Zhang, Yulun Sent: Monday, September 10, 2001 12:02 PM To: Rick, Russell Subject: prob. 17743 Russell, I think the customer need to reconfigu their kernel based on the following instruction Ingstart or nhStartDb fails with the following error messages. Checking host "rrdnms1" for system resources required to run Ingres... 11173888 byte shared memory segment required by LG/LK sub-systems. 278528 byte shared memory segment required by DBMS server(s). 0 bytes is the maximum shared memory segment size. 3 shared memory segments required. 0 is the total number of shared memory segments allocated by the system. 0 shared memory segments are currently available. Your system does not have sufficient resources to run Ingres as configured. rrdnms1% Have the customer add the following kernel parameters to the /etc/system file to get the shared mem to initialize correctly. - forceload: sys/semsys - forceload: sys/shmsys -----Original Message----- From: Rick, Russell Sent: Monday, September 10, 2001 12:47 PM To: 'or vbrown@csc.com' Subject: Concord Communications Nethealth Support Reply..........RE: Call Ticket #53869 & Problem Ticket #17743 Hi Vickie, Notes: I am the Senior Support Engineer assigned to your Escalated Call Ticket. Issue: - Error in errlog.log: nethealt::[32809 IIGCN, 00000000]: Fri Aug 31 14:13:38 20< 01 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. nethealt::[32809 IIGCN, 00000000]: Wed Sep 5 11:11:31 2001 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. nethealt::[32809 IIGCN, 00000000]: Wed Sep 5 11:18:18 2001 E_GC0139_GCN_NO_DBMS No DB ::[II_RCP , 00000001]: Thu Sep 6 14:25:45 2001 E_DM9006_BAD_FILE_WRITE Disk file write error on database:nethealth table:iiattribute pathname:/opt/nethealth/idb/ingres/data/default/nethealth filename:aaaaaaad.t00 page:2989 write() failed with operating system error 0 (Error 0) Comments: Please speak with your System Administrator about this box. In-order to bring Ingres up you will need to reconfigure the kernel based on the following instruction: Ingstart or nhStartDb fails with the following error messages. Checking host "rrdnms1" for system resources required to run Ingres... 11173888 byte shared memory segment required by LG/LK sub-systems. 278528 byte shared memory segment required by DBMS server(s). 0 bytes is the maximum shared memory segment size. 3 shared memory segments required. 0 is the total number of shared memory segments allocated by the system. 0 shared memory segments are currently available. --------> Your system does not have sufficient resources to run Ingres as configured. Your shared memory is too low to allow ingres to boot and then function. If you have any problems or the issues, please feel free to contact support@concord.com , Attn: Russ Rick. My support hours are 11:30am - 8:00pm, est. Regards, Russell K. Rick, Senior Support Engineer Domestic and International Customers Group Concord Communications, Inc. http://www.concord.com 600 Nickerson Road Marlboro, Ma. 01752 USA Toll Free: 888-832-4340 or 508-460-4646 Fax: 508-303-4343 Intl: 508-303-4300 Technical Support hours are 5:00 AM to 8:00 PM EST, Monday through Friday. Please address all responses to support@concord.com and include your assigned call ticket number in the subject field. -----Original Message----- From: Rick, Russell Sent: Monday, September 10, 2001 12:49 PM To: 'or vbrown@csc.com' Subject: FW: Concord Communications Nethealth Support Reply..........RE: Call Ticket #53869 & Problem Ticket #17743 Hi Vickie, I forgot to add the most important thing: Add the following kernel parameters to the /etc/system file to get the shared mem to initialize correctly. - forceload: sys/semsys - forceload: sys/shmsys 9/10/2001 1:38:53 PM yzhang Russell, remind customer that they need to backup the currrent system file, and reboot the system whenever the system file is modified. try ingstart after reboot. Thanks Yulun 9/10/2001 3:21:02 PM rrick -----Original Message----- From: Brown, Mary V [mailto:Mary.V.Brown@usdoj.gov] Sent: Monday, September 10, 2001 2:58 PM To: 'Rick, Russell' Subject: RE: Concord Communications Nethealth Support Reply..........RE: Call Ticket #53869 & Problem Ticket #17743 Russ, I don't understand how adding 66 elements to 250 elements could cause us to need to remake the kernel. Things were OK until I added the 66 elements... Vickie -----Original Message----- From: Rick, Russell Sent: Monday, September 10, 2001 3:05 PM To: 'Brown, Mary V'; Rick, Russell Subject: RE: Concord Communications Nethealth Support Reply..........RE: Call Ticket #53869 & Problem Ticket #17743 Hi Vickie, What may have happened is when you added those records it may have thrown you over the shared memory threshold for the system was set for. Did shanging the shared memeory help you bring up Ingres? Thanks again, - Russ 9/10/2001 3:55:20 PM rrick -----Original Message----- From: Brown, Mary V [mailto:Mary.V.Brown@usdoj.gov] Sent: Monday, September 10, 2001 3:25 PM To: 'Rick, Russell' Subject: RE: Concord Communications Nethealth Support Reply..........RE: C all Ticket #53869 & Problem Ticket #17743 Russ, I haven't gotten the guy with the power yet, so I don't know if your solution will work. Here's what /etc/system looks like now. I've included only the uncommented out entries: * set: * set maxusers=40 * * shared memmory: * set shmsys:shminfo_shmmax=15418496 set shmsys:shminfo_shmmin=200 set shmsys:shminfo_shmmni=200 set shmsys:shminfo_shmseg=200 Vickie 9/10/2001 6:02:11 PM rrick -----Original Message----- From: Brown, Mary V [mailto:Mary.V.Brown@usdoj.gov] Sent: Monday, September 10, 2001 5:44 PM To: 'Rick, Russell' Subject: RE: Concord Communications Nethealth Support Reply..........RE: C all Ticket #53869 & Problem Ticket #17743 Russ, No luck. Made the change to /etc/system, rebooted box. Still get errors. vickie 9/11/2001 10:06:04 AM yzhang have customer send me the output of ipcs. Basically I want to check that there should be three shared memory segments from ipcs Thanks Yulun 9/12/2001 11:17:06 AM yzhang Russell, Have customer try the following: 1) manully remove and recreate ingres_log through the following steps: - login as ingres - cd to ingres/log - rm ingres_log - touch ingres_log - nhResizeIngresLog size (place what ever size they have now) - ingstart (send the output of ingstart if it fails) 2) if the ingstart fails, instruct customer to start each of the ingres processes separately, and send the output of each start eg. ingstart -iigcn Thanks Yulun 9/12/2001 12:09:43 PM rrick -----Original Message----- From: Brown, Mary V [mailto:Mary.V.Brown@usdoj.gov] Sent: Wednesday, September 12, 2001 8:39 AM To: 'Rick, Russell' Subject: RE: Concord Communications Nethealth Support Reply..........RE: C all Ticket #53869 & Problem Ticket #17743 Russ, Here's what it looks like this morning. Vickie Russell, Have customer try the following: 1) manully remove and recreate ingres_log through the following steps: - login as ingres - cd to ingres/log - rm ingres_log - touch ingres_log - nhResizeIngresLog size (place what ever size they have now) - ingstart (send the output of ingstart if it fails) 2) if the ingstart fails, instruct customer to start each of the ingres processes separately, and send the output of each start eg. ingstart -iigcn Thanks Yulun -----Original Message----- From: Rick, Russell Sent: Wednesday, September 12, 2001 11:54 AM To: 'Brown, Mary V' Subject: RE: Concord Communications Nethealth Support Reply..........RE: C all Ticket #53869 & Problem Ticket #17743 Hi Vickie, Please perform the following: 1) Manully remove and recreate ingres_log through the following steps: - login as ingres user - cd to $NH_HOME/idb/ingres/log - rm ingres_log - touch ingres_log - nhResizeIngresLog size (place what ever size they have now) - ingstart (send the output of ingstart if it fails) 2) If the ingstart fails, please start each of the ingres processes separately, and send in the output of each start: eg. a. ingstart -iigcn b. ingstart -iidbms c. ingstart -iidbms recovery d. ingstart -dmfacp Thanks Russ Rick 9/12/2001 1:46:04 PM yzhang Can you send me errlog.log under idb/ingres/files, and the following 1) as nhuser after source nethealthrc.csh nhForceDb iidbdb > force.out sql iidbdb > sql.out 2) as ingres, just use the ingres window you have: rollforwarddb iidbdb infodb iidbdb Thanks Yulun 9/12/2001 2:03:37 PM yzhang Don, This customer is up and running, the ticket can be de-escalated. Their problem is that iidbdb and nethealth both getting inconsistant , and they are up and running now after rollforwarddb on iidbdb and nhForceDb on nethealth. Yulun 9/12/2001 4:23:07 PM yzhang Vickie, Can you collect the following: login as nhuser and source nethealthrc.csh 1) echo "select * from iifile_info wher< e table_name = 'nh_stats0_999284399'\g" | sql $NH_RDBMS_NAME > nh_stats0_999284399.out 2) use above window. verifydb -mreport -sdbname nethealth -otable nh_stats0_999284399 and send the iivdb.log from ingres/files If you have problem doing these, call Russell. Yulun 9/13/2001 2:05:20 PM yzhang as nhuser after source nethealthrc.csh, then run the following command. Russell, help her to make sure this table is dropped, then run nhSaveDb verifydb -mrun -sdbname "nethealth" -odrop_table "nh_stats0_999284399" 9/14/2001 2:14:50 PM yzhang run the attached unload.sh, if the unload.sh succeeded (if there is no error message comming out), run load sh, from a window logined as nhuser and sourced nethealthrc.csh. both scripts need to be placed in $NH_HOME. Thanks Yulun 9/14/2001 3:18:48 PM yzhang don't remove or delete anything from your $NH_HOME and the subdirectory. Now from nhuser window do: echo "delete from nh_rlp_boundary where max_range = 999284399 and rlp_stage_nmbr = 0 and rlp_type = `ST')\g" | sql nethealth if this succeeds, then you can run nhSaveDb Thanks Yulun 9/18/2001 2:32:17 PM yzhang customer is up and running 9/12/2001 7:21:17 AM shagar Error: "nhiDialogRollup.exe Exception: stack overflow (0xc00000fd), Address: 0x7800160e" This situation was identical to what happened in ticket #52956 Same customer as well. The end result was that they were sent a new nhiDialogRollup script (which can be found in BAGS\52000\52956) The customer is still using this .exe file Got the nhCollectCustData output from this customer to review logs The Conversations_Rollups appear to be completing. Statistics Rollups completing Data_Analysis is showing Warnings that some jobs are incorrectly defined (MyHealth and Service Level jobs) Database_Save.log shows save was successful on 9/11/01 The customer only has 2% (300MB) more space on his disk. The database continues to grow. They now want only to save the Statistical data and remove the conversations data to make more space. sending the cleanNodes.sh and cleanDlg.sh scripts However this is the second time this has occurred with their system. 9/12/2001 10:01:02 AM yzhang I read the detail description for this problem. I noticed that customer may have run the wrong script for remove the TA data. The correct script to remove the tA data is nhDropDlg.sh located on the escalated directory alone with the instruction. Find out what script they have run, and what is the result. The other thing I want to know is who provided the new nhiDailogRollup.exe, and where nhiDailogRollup.exe came from. Thanks Yulun 9/18/2001 6:03:16 PM yzhang Mike, I requested the following information a few days ago, I don't know if you have any update on this --------------------------------------------- I read the detail description for this problem. I noticed that customer may have run the wrong script for remove the TA data. The correct script to remove the tA data is nhDropDlg.sh located on the escalated directory alone with the instruction. Find out what script they have run, and what is the result. The other thing I want to know is who provided the new nhiDailogRollup.exe, and where nhiDailogRollup.exe came from. 9/19/2001 10:19:41 AM mmcnally -----Original Message----- From: McNally, Mike Sent: Wednesday, September 19, 2001 10:03 AM To: Zhang, Yulun Subject: RE: 17799/53989 Hi Yulun, Sorry, I was out yesterday. I sent the customer the nhDropDlg.sh script and they plan on working this issue later this week. I don't know where they got the new nhiDialogRollup.exe, I will further research that and let you know. All they want to do now is delete all conversation data. They are getting rid of TA. Thanks, Mike 10/2/2001 4:16:41 PM don Alain called in an dsaid this ticket can be closed.... 9/12/2001 8:17:43 PM rrick Problem: Nethealth failes to start up. No nhSaveDb. Ingres failes to start up. Errlog.log excerpt: FCCARE ::[II\INGRES\106 , ffffffff]: Tue Aug 28 10:56:11 2001 E_SC0129_SERVER_UP Ingres Release II 2.0/9808 (int.wnt/00) Server -- Normal Startup. 00000110 Wed Sep 05 13:08:08 2001 E_CL0606_DI_BADWRITE Error writing page to disk write() failed with operating system error 33 (The process cannot access the file because another process has locked a portion of the file.) 00000110 Wed Sep 05 13:08:08 2001 E_DMA44F_LG_WB_BLOCK_INFO An I/O error was encountered writing to the PRIMARY log file. At page 17660, an error was encountered writing 1 pages from buffer address 40F97A00. The current log file page size is 4096, and the buffer address is 044FC000. 00000110 Wed Sep 05 13:08:08 2001 E_CL0606_DI_BADWRITE Error writing page to disk write() failed with operating system error 33 (The process cannot access the file because another process has locked a portion of the file.) 00000110 Wed Sep 05 13:08:08 2001 E_DMA44E_LG_WBLOCK_BAD_WRITE An internal error was encountered writing a log file page. Buffer 40F97788 (001E7A00) was being written to the PRIMARY log by logwriter thread 7,1 when an error was encountered. write() failed with operating system error 33 (The process cannot access the file because another process has locked a portion of the file.) 00000110 Wed Sep 05 13:08:08 2001 E_CL0606_DI_BADWRITE Error writing page to disk write() failed with operating system error 33 (The process cannot access the file because another process has locked a portion of the file.) 00000110 Wed Sep 05 13:08:08 2001 E_CL0606_DI_BADWRITE Error writing page to disk write() failed with operating system error 33 (The process cannot access the file because another process has locked a portion of the file.) FCCARE ::[II\INGRES\106 , 00000110]: Wed Sep 05 13:08:08 2001 E_DM014D_LOGWRITER An error occurred in a session used to write logfile pages to the transaction log file. The LogWriter session will be terminated. FCCARE ::[II\INGRES\106 , 00000110]: Wed Sep 05 13:08:08 2001 E_SC0320_LOGWRITER_EXIT Logwriter Thread terminated abnormally. FCCARE ::[II\INGRES\106 , 00000110]: Wed Sep 05 13:08:08 2001 E_SC0241_VITAL_TASK_FAILURE A Server Task thread necessary to the server has terminated forcing the shutdown of the DBMS server - Server Task name ' '. FCCARE ::[II\INGRES\106 , 00000110]: Wed Sep 05 13:08:08 2001 E_PS0501_SESSION_OPEN There were open sessions when trying to shut down the parser facility. FCCARE ::[II\INGRES\106 , 00000110]: Wed Sep 05 13:08:08 2001 E_DM005B_SESSION_OPEN Session(s) are open. FCCARE ::[II\INGRES\106 , 00000110]: Wed Sep 05 13:08:08 2001 E_SC0235_AVERAGE_ROWS On 610. select/retrieve statements, the average row count returned was 9. FCCARE ::[II\INGRES\106 , 00000110]: Wed Sep 05 13:08:08 2001 E_SC0127_SERVER_TERMINATE Error terminating Server. ::[II_ACP , 0000010c]: Wed Sep 05 13:08:08 2001 E_DM9815_ARCH_SHUTDOWN Archiver was told to shut down. 0000010c Wed Sep 05 13:08:08 2001 E_CL0F10_LG_WRITEERROR The %s transaction logfile has encountered an I/O error. An attempt to write to the transaction logfile was rejected with an error. ULE_FORMAT: Couldn't look up system error (reason: ER error 10902) E_CL0902_ER_NOT_FOUND No text found for message identifier 0000010c Wed Sep 05 13:08:08 2001 E_DMA44E_LG_WBLOCK_BAD_WRITE An internal error was encountered writing a log file page. Buffer 40F963C0 (001E6600) was being written to the PRIMARY log by logwriter thread 12,1 when an error was encountered. Results from proceedure below: - log subdirectory under oping not ingres. - nhResizeIngres failed with unable to shutdown all ingres processes. - iigcn started ok. - iidbms failed with unable to start iidbms. - rrick will bug and talk with Yulun in morning. Proceedure: 1) Manully remove and recreate ingres_log through the following ste< ps: - login as ingres user - cd to $NH_HOME/idb/ingres/log - rm ingres_log - touch ingres_log - nhResizeIngresLog size (place what ever size they have now) - ingstart (send the output of ingstart if it fails) 2) If the ingstart fails, please start each of the ingres processes separately, and send in the output of each start: eg. a. ingstart -iigcn b. ingstart -iidbms c. ingstart -iidbms recovery d. ingstart -dmfacp files in BAFS/escalated tickets/53000/53981. 9/13/2001 4:13:10 PM schapman After review of information by Yulun requested escalation. 9/14/2001 10:44:11 AM yzhang Russell, by looking at the errlog.log, I think you have given the right instruction to customer. only suggestion is that they need to stop nhServer when removing transaction log. about starting ingres process on NT individually, you might want to test on your NT, see how is work. because on NT we use service to start ingres. Also ask customer to check their disk, and hardware system. Thanks Yulun 9/14/2001 2:52:27 PM yzhang At this point, they have transaction logsize of 510 MB, Is this the size they want. what is the original size before resize. If the transaction log size is ok, then the easy way is to go to contral panel/service, stop then start ingres from there because they are on NT system. Thanks YUlun 9/16/2001 10:37:25 AM yzhang Russell, The tar file you attched can not be uncompressed, have customer re run this, and make sure use bin mode when ftp with the tar file. Thanks Yulun 9/17/2001 11:18:56 AM yzhang This is the console message from customer, it is an escalated ticket (17816). The original problem is unable to resize ingresLog and unable to start ingres. The both problem has been solved. But now when they started console, every thing is come up and runing for 30 second, then it crash immediately with the error: Friday, 9/14/2001 02:43:58 PM Error (Statistics Poller) Read of 'dataSourceInfo.ddi' failed. Friday, 9/14/2001 02:44:00 PM The server stopped unexpectedly, restarting . . . do you know where is dataSourceInfo.ddi, how it created. and so on. Thanks Yulun 9/17/2001 2:12:14 PM yzhang Russel, have them run nhiGenDataSourceInfo (may be in the bin/sys, you can do some search or find)) to create dataSourceInfo.ddi, then start nethealth also get the install log from customer. here is what their problem now. Friday, 9/14/2001 02:43:58 PM Error (Statistics Poller) Read of 'dataSourceInfo.ddi' failed. Friday, 9/14/2001 02:44:00 PM The server stopped unexpectedly, restarting . . . 9/18/2001 11:36:45 AM yzhang I want to see if your nethealth installation was succeeded. Can you send me the following: 1) install.log on /$NH_HOME/log/install 2) $NH_HOME/poller/dataSourceInfo.ddi I appreciate if you can send these as soon as possible. Thanks Yulun 9/18/2001 2:29:52 PM yzhang Don, This customer is up and running, and it can be de-escalated 11/8/2001 12:24:15 PM yzhang problem solved and ticket is closed 9/13/2001 10:43:05 AM foconnor Database is corrupted with this error: E_RD0060 Cannot access table information due to a non-recoverable DMT_SHOW error (Mon Sep 10 05:08:27 2001) The nhCollectCustData script does not run. 'This information i got in the ingres file errlog.log after that we restarted the system with power off - on. iivdb.log has ~40 pages of statements similar to below(See //BAFS/53000/53630/Sept10: ************************************************* verifydb 11-Sep-2001 10:27:08 ************************************************* S_DU04C4_DROPPING_TABLE VERIFYDB: beginning the drop of table nh_stats0_995241599 from database nethealth. E_DU5025_NO_SUCH_TABLE Table nh_stats0_995241599 (owner ingres) does not exist. E_DU5024_TBL_DROP_ERR Unable to destroy table nh_stats0_995241599 from database nethealth. From the errlog.log OW Cannot access table information due to a non-recoverable DMT_SHOW error CONCORD ::[32802 , 00000013]: Tue Sep 11 07:49:44 2001 E_PS0904_BAD_RD F_GETDESC RDF error occurred when getting description for an object. CONCORD ::[32802 , 00000013]: Tue Sep 11 07:49:44 2001 E_PS0007_INT_OT HER_FAC_ERR PSF detected an internal error when calling other facility. CONCORD ::[32802 , 00000013]: Tue Sep 11 07:49:44 2001 E_SC0215_PSF_ER ROR Error returned by PSF. Concord% 9/13/2001 11:06:22 AM yzhang the ingres system catalog has been corrupted, customer need to destroy, create and reload the databse 9/17/2001 1:04:06 PM foconnor Cusotmer reinstalled ingres - closed ;9/13/2001 2:13:20 PM mwickham The following error is generated in the Data Analysis log file each night after upgrading to Network Health 4.8: Error: Unable to execute 'MODIFY nh_elem_outage TO MERGE' (E_US1595 MODIFY: nh_elem_outage: table is not a btree; only a btree table can be modified to merge. (Thu Sep 6 13:16:18 2001) Database consists of 464 elements and is 513MB in size. There are no probes. 9/13/2001 4:04:57 PM mwickham -----Original Message----- From: Trei, Robin Sent: Thursday, September 13, 2001 03:44 PM To: Wickham, Mark Subject: FW: AR System Notification It looks like they may have run out of space when doing the upgrade, and a table did ont get indexed. Could you get the following informaiton for me? echo "select * from nh_rpt_config\g" | sql $NH_RDBMS_NAME > config_ids.out echo "help table *\g" | sql $NH_RDBMS_NAME > table_info.out 9/14/2001 2:37:42 PM mwickham Customer has run the requested SQL statements and provided the resulting output files, config_ids.out and table_info.out. They can be found on BAFS in \escalated tickets\52000\52964\14Sep01. 9/19/2001 4:03:42 PM mwickham -----Original Message----- From: Wickham, Mark Sent: Wednesday, September 19, 2001 03:48 PM To: Trei, Robin Subject: Problem Ticket 17828 (52964) Robin, I updated the subject problem ticket with the requested information. Have you had a chance to review the script output files? Thanks - Mark 9/24/2001 3:29:43 PM mwickham -----Original Message----- From: Wickham, Mark Sent: Monday, September 24, 2001 03:14 PM To: Gray, Don; Bailey, Tom Subject: Call Ticket 52964 / Problem Ticket 17828 Please escalate the subject problem ticket. Data analysis has been failing since the customer's upgrade to 4.8, and they are getting very anxious for a resolution. Thanks - Mark 10/18/2001 6:40:48 AM mwickham Please close this problem ticket. The customer is no longer experiencing this problem, as we are performing the upgrade for them. Thanks. 10/19/2001 8:57:30 AM foconnor Closed as per Mark W. .9/19/2001 8:17:32 AM cestep Database saves and Data Analysis are failing. Data Analysis fails with: Warning: Unexpected database error. Error: Unexpected database error. Database save fails with: Unload the dac tables. . . Unloading table nh_daily_exceptions_1000001 .... Fatal Internal Error: Unable to execute 'COPY TABLE nh_daily_exceptions_1000001 () INTO '/health/db/save/Sept9-18.tdb/nh_daily_exceptions_1000001'' (E_US0845 Table 'nh_daily_exceptions_1000001' does not exist or is not owned by you. (Tue Sep 18 11:03:40 2001) ). (cdb/DuTable::saveTable) The table nh_daily_exceptions_1000001 does not exist. Ingres went down and nethealth became inconsistent. Ran nhForceDb and made nethealth consistent. Since we can not save successfully, this database is still up. We ran listDacs and got: Listing all service profiles known in database name, config_id PP_Plant_Traffic_lw, 1000001 Golden_Cat_WAN_Traffic_lw, 1000002 LAN-TRAFFIC_lw, 1000003 Frame_Relay_Report_lw, 1000004 Unix_Backups_lw, 1000005 Plant_LAN_Traffic_lw, 1000006 3A_Server_Ring_lw, 1000007 Internet_lw, 1000008 Frame_pipe_weekly_lw, 1000009 GC_LAN_Traffic_lw, 1000010 standard, 1000011 Cent_Sw_Report_lw, 1000012 standard, 1000013 standard, 10000< 14 standard, 1000015 3A_Server_Ring_lw, 1000016 7AM-7PM, 1000017 Listing all service profiles with stored data in health tables config_id, number of rows Ran cleanDacs, removed all the profiles except the following: Listing all service profiles known in database name, config_id standard, 1000011 7AM-7PM, 1000017 Listing all service profiles with stored data in health tables config_id, number of rows Tried to run DB save again, this time it fails with a different config_id: Unload the dac tables. . . Unloading table nh_daily_exceptions_1000011 .... Fatal Internal Error: Unable to execute 'COPY TABLE nh_daily_exceptions_1000011 () INTO '/health/db/save/Sept9-18.tdb/nh_daily_exceptions_1000011'' (E_US0845 Table 'nh_daily_exceptions_1000011' does not exist or is not owned by you. (Tue Sep 18 15:56:01 2001) ). (cdb/DuTable::saveTable) As you can see, from the listDacs results, this ID corresponds to the only remaining "standard" service profile. Data Analysis is also still failing. All log files are on BAFS, under ticket #53589. The most up to date files are under 9.19.01 9/19/2001 10:09:39 AM yzhang Colin, please collect nhCollectCustData, also send nhReset. Thanks Yulun 9/19/2001 10:41:08 AM cestep Requested the additional information. Changing ticket to 'moreinfo'. 9/19/2001 11:42:48 AM cestep Received DbCollect.tar and nhReset. They are on BAFS, ticket #52589/9.19.01 9/19/2001 11:43:04 AM cestep -----Original Message----- From: cspradlin@purina.com [mailto:cspradlin@purina.com] Sent: Wednesday, September 19, 2001 11:21 AM To: support@concord.com Subject: ticket 53589 FYI. This error showed up when I ran the nhCollectCustData command. > nhCollectCustData nethealth > > If Ingres is not running, then no ingres related information can > be > collected, but the system and operating messages are still > collected in > /appl01/health/tmp/oslogs.tar > > If ingres is running, all collected information is stored in > /appl01/health/tmp/DbCollect.tar > > > > SQL Error: > E_LQ0059 Unable to start up 'fetch csr' command. > Unexpected initial protocol response. > 9/19/2001 3:29:04 PM yzhang Colin, Find out what is the successful dbsave they have. also send me the following: login as nhuser source net*.csh echo "select * from iifile_info where table_name = 'nh_daily_exceptions_1000011' login as ingres source nethealthrc.csh optimizedb $NH_RDBMS_NAME > optimize.out sysmod $NH_RDBMS_NAME > sysmod.out Thanks Yulun 9/24/2001 11:26:01 AM cestep Received info. On BAFS, under ticket # 53589/9.24.01 9/24/2001 1:04:56 PM cestep Changing back to assigned. 9/25/2001 10:06:27 AM yzhang Robin, This is a problem regarding dbsave and dataanalysis fail. dbsave fails is due to that the expected splited dac table such as nh_daily_exceptions_1000001 does not exist, but the config _id is in nh_rpt_config. I thought this table should be created from db conversion. The other problem is that assume I create new empty 48 db, then I insert some row into nh_rpt_config table, immediately after I do dbsave, in this case I will get the splited dac table does not exist error I can have customer create the splited table and populate the data, then savedb will be succeed. but this is short term solution. I think you worked with the same kind of problem before. Thanks Yulun 9/25/2001 11:41:39 AM yzhang Colin, please get script called create_index_split_dac.sh from ~yzhang/scripts, and send it customer. run the script using the following command, they need to run as nhuser ans after sourcing. create_index_split_dac.sh > create_index_split_dac.out and send me the create_index_split_dac.out. I looked at their previouse load.log, which was failed on the same table. This is why the save failed 9/25/2001 12:33:44 PM bhinkel Based on above update from Yulun, set to MoreInfo. 9/25/2001 1:10:31 PM cestep Received file. Placed on BAFS under ticket #53589. changing back to assigned. 9/25/2001 1:27:17 PM yzhang Ask him a dbsave now, then send you the save.log. 9/26/2001 7:46:46 AM cestep From save.log: Begin processing (09/25/2001 01:27:12 PM). Copying relevant files (09/25/2001 01:27:12 PM). Unloading the data into the files, in directory: '/health/db/save/./Sept25.tdb/'. . . Unloading table nh_active_alarm_history . . . Unloading table nh_active_exc_history . . . Unloading table nh_alarm_history . . . Unloading table nh_alarm_rule . . . Unloading table nh_alarm_threshold . . . Unloading table nh_alarm_subject_history . . . Unloading table nh_bsln_info . . . Unloading table nh_bsln . . . Unloading table nh_calendar . . . Unloading table nh_calendar_range . . . Unloading table nh_assoc_type . . . Unloading table nh_elem_latency . . . Unloading table nh_element_class . . . Unloading table nh_elem_outage . . . Unloading table nh_elem_alias . . . Unloading table nh_element_ext . . . Unloading table nh_elem_analyze . . . Unloading table nh_enumeration . . . Unloading table nh_exc_profile_assoc . . . Unloading table nh_exc_subject_history . . . Unloading table ex_tuning_info . . . Unloading table exception_element . . . Unloading table exception_text . . . Unloading table nh_exc_profile . . . Unloading table ex_thumbnail . . . Unloading table nh_exc_history . . . Unloading table nh_list . . . Unloading table nh_list_item . . . Unloading table hdl . . . Unloading table nh_elem_latency . . . Unloading table nh_col_expression . . . Unloading table nh_element_type . . . Unloading table nh_le_global_pref . . . Unloading table nh_elem_type_enum . . . Unloading table nh_elem_type_var . . . Unloading table nh_variable . . . Unloading table nh_mtf . . . Unloading table nh_address . . . Unloading table nh_node_addr_pair . . . Unloading table nh_nms_defn . . . Unloading table nh_elem_assoc . . . Unloading table nh_job_step . . . Unloading table nh_list_group . . . Unloading table nh_run_schedule . . . Unloading table nh_run_step . . . Unloading table nh_job_schedule . . . Unloading table nh_system_log . . . Unloading table nh_step . . . Unloading table nh_schema_version . . . Unloading table nh_stats_poll_info . . . Unloading table nh_import_poll_info . . . Unloading table nh_protocol . . . Unloading table nh_protocol_type . . . Unloading table nh_rpt_config . . . Unloading table nh_rlp_plan . . . Unloading table nh_rlp_boundary . . . Unloading table nh_stats_analysis . . . Unloading table nh_subject . . . Unloading table nh_schedule_outage . . . Unloading table nh_element . . . Unloading table nh_deleted_element . . . Unloading table nh_var_units . . . Unloading the sample data . . . Unloading the latest sample data definition info . . . Unloading table nh_rlp_boundary . . . Unloading table nh_stats_poll_info . . . Unloading table nh_import_poll_info . . . Unloading table nh_bsln_info . . . Unload the dac tables. . . Unloading table nh_daily_exceptions_1000011 .... Unloading table nh_daily_symbol_1000011 .... Unloading table nh_daily_health_1000011 .... Unloading table nh_hourly_health_1000011 .... Unloading table nh_hourly_volume_1000011 .... Unloading table nh_daily_exceptions_1000017 .... Unloading table nh_daily_symbol_1000017 .... Unloading table nh_daily_health_1000017 .... Unloading table nh_hourly_health_1000017 .... Unloading table nh_hourly_volume_1000017 .... Unload of database 'nethealth' for user 'health' completed successfully. End processing (09/25/2001 02:42:15 PM). ------------------------------------------------------------------------------------------------------------------------------------------------------ However, Chuck says that the conversations rollup and Data Analysis are still failing. Updated Data Analysis log, Rollup log and errlog.log on BAFS, under 53589/9.26.01 9/26< /2001 12:50:02 PM yzhang Colin, the customer's db has been corrupted, he need to do destroy, create and reload (load the db he saved yesterday), after loading, send me the following: 1) echo"select table_name,num_rows,create_date,storage_structure from iitables order by table_name\g" | sql nethealth > table.out 2) check the disk partition for II_WORK ($NH_HOME/idb), and send df -k . > II_WORK.out 3) find out the transaction logsize. You propably need to work with him over the pone on those. Thanks Yulun 9/27/2001 11:34:45 AM yzhang get script named clearElemAddrDup_12022.sh from ~yzhang/scripts, stop nhserver, then run the script as nhuser after doing the source, send me clearElemAddrDup.out file from $NH_HOME/tmp. this script will remove all of the dupliacte appeared in load.log Thanks Yulun 9/27/2001 12:44:19 PM yzhang I think the duplicate has been taken care, now run data analysis and conversation roll manully, and send us the output. 9/27/2001 4:05:25 PM cestep Data Analysis is successful. Need a new ticket for the conversations rollup. 9/27/2001 4:16:47 PM yzhang Let's take care the conversation rollup fail: please do the following: echo "drop table nh_dlg0_1000443599\g" | sql $NH_RDBMS_NAME echo " delete from nh_rlp_boundary where max_range=1000443599 and rlp_stage_nmbr=0 and rlp_type = 'SD'\g" | sql $NH_RDBMS_NAME after these run conversation rollup manually. Thanks Yulun 9/27/2001 4:21:14 PM cestep Associating call ticket #54633 for the rollup failure issue. 9/27/2001 4:32:23 PM yzhang I think the problem for the warning is due to that when reloading database, the group and grouplist information get lost, do the comparison to verify this is the case. Thanks Yulun 9/30/2001 12:36:46 PM yzhang Don, This customer's dbsave, data analysis and conversation rollup seems running, and the ticket can be de-escalated. Colin, Can you work with customer on comparison of number and type of elements appeared in *.grp and these showed from console. I sent you instruction previously. This is for takingcare the warning message. Thanks Yulun 10/4/2001 7:37:34 AM cestep Customer's Data Analysis and Rollups are now successful. The issue has been resolved. 10/4/2001 9:41:02 AM yzhang Colin, Thanks very much for taking care the warning message, can you tell me why the warning is there and how did you get ride of it. Let's close this ticket. Yulun 19/20/2001 8:05:41 PM klandry Summary: - Customer is seeing r-norm as the speed for newly discovered elements. - This has been happening for approx 5 months. - nhDbStatus shows 840,000 nodes, - The problem appears that the nxt_hdl has wrapped into the 2 million range (server_id=1). - Thus all newly discovered elements are considered "remote" with r-norm for the poll speed. - They are running TA and Stats off of the same Sol server, ver 4.8p3. Files on BAFS: errlog.log.txt -----Original Message----- From: Burke, Walter Sent: Thursday, September 20, 2001 6:53 PM To: Landry, Keith Subject: Bug: - Spoke w/ Jay. - There should be some other way of determining whether or not an element is remotely polled or not. - Currently we use only element_id range, as determined by server_id, to differenciate between central and remote elements. - As customers reach the wrap point of the nxt_hdl range, newly discovered elements are assigned the next logical id. If this id is in a new range, i.e. from 1,999,999,999 to 2,000,000+ the element will be considered by the product as remotely polled, even though it is not. 9/21/2001 9:28:39 AM rtrei Yulun-- I think we just need to reset something. But we should discuss what! :> 10/9/2001 11:23:05 AM yzhang Walter, We need to give more attention to this customer, they not only have nxt_hdl over 2000000, but currently they have following possible problems: 1) possibly out of disk space (they only have about 3G free, and db size is more than 5G) 2) possibly a table has hitting 2 gb 3) errlog.log shows SCF alert at almost every mid night. Please check and collect the following: 1) list of the physical file size 2) ask them increae the disk space 5 to 8 G will be fine 3) increase the transaction log to 2G 4) find out what are they running aroud each midnight(eg, the scheduled jobs) 5) echo "copy table nh_element() into 'nh_element.dat'\g" sql $NH_RDBMS_NAME 6) echo "select * from hdl\g" | sql $NH_RDBMS_NAME > hdl.out THanks Yulun 10/9/2001 11:34:45 AM wburke -----Original Message----- From: Burke, Walter Sent: Tuesday, October 09, 2001 11:18 AM To: 'Cynthia.badgett@hqasc.army.mil' Subject: 54302 - Rnorm objects Cynthia, Currently we may have following possible problems: 1) possibly out of disk space (they only have about 3G free, and db size is more than 5G) 2) possibly a table has hitting 2 gb 3) errlog.log shows SCF alert at almost every mid night. Please check and collect the following: 1) list of the physical file size - ls -l of $NH_HOME/idb/ingres/data/default/nethealth - redirect into a file and send. 2) Increase the disk space from 5 to 8 G. - Ask you sysAdmin if this is possible. 3) increase the transaction log to 2G - Increase space 1st. - nhServer stop - nhResizeIngresLog 1950 - nhServer start 4) find out what are they running aroud each midnight(eg, the scheduled jobs). - nhSchedule -list -full > list.out - send list.out 5) echo "copy table nh_element() into 'nh_element.dat'\g" | sql nethealth 6) echo "select * from hdl\g" | sql nethealth > hdl.out Sincerely, 10/11/2001 1:45:15 PM wburke -----Original Message----- From: Burke, Walter Sent: Thursday, October 11, 2001 1:28 PM To: 'Cynthia.badgett@hqasc.army.mil' Subject: 54302 Cynthia, I am awaiting the following: 1) list of the physical file size - ls -l of $NH_HOME/idb/ingres/data/default/nethealth - redirect into a file and send. 2) Increase the disk space from 5 to 8 G. - Ask you sysAdmin if this is possible. 3) increase the transaction log to 2G - Increase space 1st. - nhServer stop - nhResizeIngresLog 1950 - nhServer start 4) find out what are they running aroud each midnight(eg, the scheduled jobs). - nhSchedule -list -full > list.out - send list.out 5) echo "copy table nh_element() into 'nh_element.dat'\g" | sql nethealth 6) echo "select * from hdl\g" | sql nethealth > hdl.out Sincerely, 10/12/2001 3:03:37 PM yzhang Thanks for the info, I think I still need nh_element.dat, see if you can get it for today echo "copy table nh_element() into 'nh_element.dat'\g" | sql nethealth 10/12/2001 3:51:18 PM yzhang I want to see the actual element_id, ask to see if we can have following: echo "select element_id, name from nh_element order by element_id\g" | sql $NH_RDBMS_NAME > element_id.out 10/16/2001 5:06:36 PM wburke Obtained all Bafs/54302 11/2/2001 2:49:16 PM yzhang Robin, This is a problem Walter talked to Jay about. The problem is that if element_id is over 2000000, the element will be considered to be an element polled from remote (even though it is not) ,so that the element info does not displayed as it should be for 1000000 range element. You mentioned we need to reset something, do you have an idea about what we are going to reset? My guess is that if we reset the NH_SERVER_ID to 10, instead of 1 (default value), then the first 10,000,000 element will be considered to be polled locally, and this will make the element info display properly. (this might be a bad idea). This is what Wlater talked to Jay - There should be some other way of determining whether or not an element is remotely polled or not. - Currently we use only element_id range, as determined by server_id, to differenciate between central and remote elements. - As customers reach the wrap point of the nxt_hdl range, newly discovered elements are assigned the next logical id. If this id is in a new range, i.e. from 1,999,999,999 to 2< ,000,000+ the element will be considered by the product as remotely polled, even though it is not. Thanks Yulun 11/6/2001 1:01:53 PM jay The problem is a result of a hack in /wsApps/pollerUi/PlrCfgUiData.C in that code their is a function PlrCfgUiData::buildUiString () that states: if ((elementId > 1000000 && elementId < 1000000 * nhServerId + 1) || elementId > 1000000 * nhServerId + 1000000) { pollRateString = "R-"; } Another check should be made before this if that is: if (pElement->getIpAddress () == "0.0.0.0") ... This will only apply the site rule to those elements that look like they came from remote poller. A better, more expensive approach is to tag the remote elements once they come from the central site with a new column (or overloaded column) that shows that it came from a remote poller. 11/6/2001 1:02:14 PM jay I recommend that the UI group address this Poller Config UI issue. 11/16/2001 11:50:01 AM yzhang will do code change in prlCfgUiData.c under under framwork/poller Int nhServerId = CuSysInfo::getIntVal (WscSitNhServerId); Int elementId = pElement->getDbId (); if ((elementId > 1000000 && elementId < 1000000 * nhServerId + 1) || elementId > 1000000 * nhServerId + 1000000) { pollRateString = "R-"; } 11/16/2001 11:50:50 AM yzhang in wsApps/PollerUi 12/31/2001 10:47:05 AM mwickham -----Original Message----- From: Wickham, Mark Sent: Monday, December 31, 2001 10:35 AM To: Zhang, Yulun Subject: Problem Ticket 18012 (56639) Hi Yulun, Can I have a status on this problem ticket, please? Thanks - Mark 5/6/2002 5:13:15 PM rtrei In 5.0 we pulled the TA nodes from the nh_element table. That should prevent this situation from happening from 5.0 on. I am not sure we have a good solution for 4.8 other than manually handling it. I am marking this declined. If tech support disagrees with me, please raise the issue to Rich Hawkes or Robin Trei 5/10/2002 2:17:35 PM mwickham We sent this message to all customers associated with this bug (that in 5.0 and beyond this should not be an issue.) One of our customers, call ticket 54302, promptly replied stating they were still experiencing this on 5.0.1 D02 P01. What sort of evidence do you want from the customer? Thanks - Mark 5/10/2002 2:30:09 PM mwickham -----Original Message----- From: Trei, Robin Sent: Friday, May 10, 2002 02:25 PM To: Wickham, Mark Subject: RE: Problem Ticket 18012 Oops, I meant newly created databases (after 5.0.2) But get a copy of the hdl table, and select count(*) from nh_element_core select min(element_id), max(element_id) from nh_element_core as well Do the same for nh_node. Let me know if you need help translating the above into correct sql syntax 5/10/2002 5:33:03 PM mwickham -----Original Message----- From: Wickham, Mark Sent: Friday, May 10, 2002 05:33 PM To: Trei, Robin Subject: Problem Ticket 18012 Robin, We sent the attached script to the customer who's running 5.0.1 and still experiencing the problem. Below, she provides the output. Thank you, Mark <18012.out email attached> 5/10/2002 5:59:28 PM mwickham -----Original Message----- From: Trei, Robin Sent: Friday, May 10, 2002 05:49 PM To: Wickham, Mark Subject: RE: Problem Ticket 18012 OK, thanks for bringing this to my attention. Any databases created after 5.0 will not have this problem, and any customers that do not already have this problem will not develop this problem, but customers that already had this problem were not helped with the fix. Please reset the declined field on this ticket and we will re-evaluate this ticket and determine if there is anything we can do for customers who already have this problem. 5/21/2002 12:34:47 PM rhawkes Setting to declined. 5/29/2002 2:35:07 PM mwickham Customer upgraded from 4.8 to 5.0.1 D02 P01 and continues to see this problem. Per Robin's message above, this is expected...we need to fix this since at least three other customers will experience this as they move from 4.8 to 5.0.x. 6/6/2002 3:07:44 PM mmcnally I have attached Call ticket 64989 to this as Kwok is on 5.02 migrated from 4.8 and is seeing R-norm elements . 6/6/2002 4:51:18 PM mwickham >-----Original Message----- >From: Wickham, Mark >Sent: Thursday, June 06, 2002 4:24 PM >To: Karten, Eric >Subject: Problem Ticket 18012 "Seeing r-norm for newly discovered >elements" > >Eric, > >We informed one of our customers (ICS) that this ticket was declined >taking the following verbage from the problem ticket: "In 5.0 we pulled >the TA nodes from the nh_element table. That should prevent this >situation from happening from >5.0 on." > >They replied and would like you to evaluate this statement and let them >know if it is correct or not: > >Yes, the TA nodes are now stored in the nh_node table and not in the >nh_element_xxxx anymore, but they still use the element_id in both >tables. > >So, if the element_id counter is increased by the nodes in TA, it is >still increased for the elements in the element tables too. >So, we believe this problem still exists in 5.0. > >Can you let me know if this is accurate or not, please? > >Thanks - Mark -----Original Message----- From: Karten, Eric To: Wolf, Jay Sent: 6/6/2002 4:28 PM Subject: FW: Problem Ticket 18012 "Seeing r-norm for newly discovered elements" I think you may be able to answer this. I can't. E -----Original Message----- From: Wolf, Jay Sent: Thursday, June 06, 2002 04:46 PM To: Karten, Eric Cc: Wickham, Mark Subject: RE: Problem Ticket 18012 "Seeing r-norm for newly discovered elements" Eric, This is accurate. We should split out the node IDs from the stats element IDs going forward. Especially considering the new remote poller will still partition the element ID ranges. I think the change is: 1) CdbPlrNodePairs::assignAddr () { Call to _elementTbl.getUniqueId () should assign from a different ID pool. 2) Db convert, create a hdl table entry for the new ID pool. Jay 6/6/2002 4:52:09 PM mwickham Changing to Assigned based on the email thread above and Jay's resulting comments. 6/13/2002 10:21:56 AM rhawkes Per this morning's escalated tickets meeting, Support will find out if the customer impact of this ticket is limited to visual impact in the UI. If so it may be possible to wait for a fix to the problem which Rob. L indicated will be fixed in v5.6. 6/13/2002 11:22:19 AM mmcnally I just spoke to Kowk Lee(64989) about this and he said this is indeed just a visual impact in the UI. He would like to see it fixed in 5.0 but as this is not a critical issue that is your call. Thanks, Mike 6/13/2002 2:13:29 PM apier De-escalated. We will explain that it cannot be fixed in 5.0.2 8/2/2002 11:33:41 AM don changing to assigned, I dont see a request for info here. 8/7/2002 9:03:06 AM dbrooks see notes. k9/21/2001 2:52:31 PM rkeville Install of nethealth and creation of database hangs at an SQL> prompt - When running the NH install it appears to hang at an SQL> prompt. - Ran out of disk space on the partition they were installing on. - Need to add disk space checking to the install script to verify there is enough room to create the database in the partition specified. - When the uninstall_ora -d was run the database and nethealt install were successfully removed but the instance was left running causing the second install to fail. - The instance needs to be shut down by the uninstall_ora script to prevent running the install again on a running instance. - What happens if the DBA changes the password for the sys and internal users? If someone attempts to run the INSTALL.NH script again after that, which has change_on_install hard coded into it, then the install will fail. Trouble shooting details are in the call ticket associated with this problem ticket. ######################################################## 9/24/2001 11:37:43 AM wzingher Dup of 12865 9/25/2001 3:13:18 PM rrick< Error: nhiSaveDb.exe: Fatal Internal Error: Unable to execute "COPY TABLE nh_job_step() INTO 'E:/nethealthdb.tdb/njs_b47" E_RD0060 Cannot access table information due to a non-recoverable DMT_SHOW 9/25/2001 3:15:00 PM rrick All files on bafs/escalated tickets/54000/54306. 9/28/2001 2:32:22 PM rrick We got by this. I am closing it. 9/25/2001 4:30:49 PM mgenest Database stops intermittently. Customer reports that database dies every few days without warning. Stack dump and lots of Internal DMF errors detected error messages in the errlog.log. His ingres_log is 1048576000. Polling 3700 elements. Only tickets related seem to be for ingres bug. Looks like same issue as in problem ticket # 15709. Wanted to send the customer this solution. Talked to Bob Keville and he did not understand which parameters we were supposed to set. Said to contact Yulun Zhang. Contacted Yulun, and he said to write a problem ticket and send to database group. Results of nhCollectCustData are on BAFS/54000/54363 9/25/2001 5:32:10 PM wburke -----Original Message----- From: Burke, Walter Sent: Tuesday, September 25, 2001 5:16 PM To: 'matt.overbay@nwdc.ibs-lmco.com' Subject: Ticket # 54363 Matt, I have been assigned this call ticket. Issue: Ingres dies intermittantly due to large amounts of internal errors. Status: At this time I have logged a Problem ticket with the DbTeam. We are reviewing the problem, and will contact. Sincerely, 10/4/2001 6:06:12 PM yzhang the customer's db has been corrupted about two weeks ago, I already have them running the savedb, if the save succeeded, they need to destroy, create and reload. can you call him tomorrow, and help him with the cycle. Yulun 10/5/2001 11:27:15 AM wburke -----Original Message----- From: Burke, Walter Sent: Friday, October 05, 2001 11:11 AM To: 'matt.overbay@nwdc.ibs-lmco.com' Subject: Ticket # 54363 - Db stops intermittently Hi Matt, My understanding is that the database was to be saved, destroyed, created and re-loaded. 1. Was the save successful? If not, forward the $NH_HOME/log/save.log 2. If so, do you need assistance with the next steps? Sincerely, 10/5/2001 12:38:15 PM wburke -----Original Message----- From: Overbay, Matt [mailto:Matt.Overbay@nwdc.ibs-lmco.com] Sent: Friday, October 05, 2001 11:29 AM To: 'Burke, Walter' Subject: RE: Ticket # 54363 - Db stops intermittently Walter, I am currently in the process of saving the database. IT should be done in the next 2 hours. I do need assistance is rebuilding the database and removing the corruption. Thanks Matt Overbay. 10/9/2001 3:24:18 PM wburke -----Original Message----- From: Burke, Walter Sent: Tuesday, October 09, 2001 3:07 PM To: 'matt.overbay@nwdc.ibs-lmco.com' Subject: Ticket # 54363 -Db Hangs Hi Matt, How did the application run over the weekend? Do you see any problems since the re-load? Thanks, 10/9/2001 3:49:09 PM wburke -----Original Message----- From: Overbay, Matt [mailto:Matt.Overbay@nwdc.ibs-lmco.com] Sent: Tuesday, October 09, 2001 3:40 PM To: 'Burke, Walter' Subject: RE: Ticket # 54363 -Db Hangs I have not seen any probelms as yet. Matt O- LMCO 10/9/2001 3:51:35 PM yzhang destroy, create and reload db solved the problem 9/26/2001 11:13:07 AM rkeville Database marked inconsistent after customer runs nhDbStatus from GUI. - A core file is generated by nhiDbServer. - Deadlock on iirelation, recovery server is unable to perform recovery of nethealth database. - Database marked inconsistent after inablility to recover or create a table control block. Requested the following from the customer: - $II_SYSTEM/ingres/files/iircp.log - $II_SYSTEM/ingres/files/iiacp.log - $II_SYSTEM/ingres/files/errlog.log - System messages log from the nethealth console. - Database dropdown -> Save system log as - $NH_HOME/nethealthrc.csh, nethealthrc.sh, nethealthrc.csh.usr and nethealthrc.sh.usr files. - verifydb -mreport -sdbname 'nethealth' -odbms_catalogs - verifydb -mreport -sdbname 'iidbdb' -odbms_ catalogs - $II_SYSTEM/ngres/files/iivdb.log file. - sysmod nethealth - Send the output from the screen text after running this command. Core file and original errlog.log file located on BAFS under ticket 54440. This is a potential revenue impact issue, the customer has spent a great deal of money on a new system to run TA on to ensure this would not break again, thier perception is that we broke again. There is another deal with them on the line right now and Sue Ramse wants this addressed by end of the quarter to make the sale. ############################################################## 9/26/2001 4:05:59 PM rkeville Output files on BAFS ####################################################### 9/26/2001 5:34:20 PM yzhang requested nhCollectCustData 9/27/2001 10:41:20 AM yzhang Did you have customer collect the nhCollectCustData 9/27/2001 11:06:27 AM rkeville Its on BAFS under 54440/9-27-2001 -Bob ############################################################ 9/27/2001 2:53:37 PM yzhang The customer has an ingres crash on Sept/7/2001, at 4:39PM, the crash is due to corruption of iirelation, and control block, which further causes dead lock and database inconstant. Support here has done destroy, create and reload right after the crash, now the customer is up and running, the database nethealth and master database of iidbdb are consistent. Bob , what I mentioned here is basically what you said on the description, I recommended customer run nhDbStats from command line in the future, and make sure they have the database save scheduled. I explained the problem to customer, and he said this can be closed Don, can you close this one Thanks Yulun #9/28/2001 1:24:58 PM rrick Problem: The customer reported every time reboot the eHealth Server, the NT event log will append a new error log message from Ingres: The description for Event ID ( 2003 ) in Source ( Ingres ) could not be found. It contains the following insertion string(s): This product/program is licensed to: CONCORD COMMUNICATIONS INC Site ID: 0167267. nhiIndexStats.exe: Internal Error: Unable to connect to database 'nethealth' (E_LQ0001 Failed to connect to DBMS session. E_LC0001 GCA protocol service (GCA_REQUEST) failure. Internal service status E_GC0139 -- No DBMS servers (for the specified database) are running in the target installation.. ). (du/DuDatabase::dbConnect) The description for Event ID ( 1006 ) in Source ( Network Health ) could not be found. It contains the following insertion string(s): nhiStdReport.exe: Fatal Error: Assertion for 'cuError.isOk()' failed, exiting (in file ./CdbTblParent.C, line 175). Customer clains that there is no signification database issue. The eHealth data poller and report generation are working. All files Located on \bafs\53000\53369. 10/1/2001 10:27:35 AM yzhang do you know how to read the nt applicatiom event.log 10/1/2001 11:19:39 AM rkeville Start -> Programs -> Administrative Tools -> Event Viewer. Log Menu -> Open -> Navagate to file. ############################################################## 10/16/2001 3:55:28 PM yzhang I can not open the application event file in the escalated directory, if you can open it, you can compare if what's in the event log also appears in the error log. and serious this error is. Thanks Yulun 10/17/2001 10:08:29 AM rkeville -----Original Message----- From: Keville, Bob Sent: Tuesday, October 16, 2001 4:35 PM To: Zhang, Yulun Subject: RE: ProbT0000018152 Yulun, The Application log contains the following error for Ingres: The description for Event ID ( 2003 ) in Source ( Ingres ) could not be found. It contains the following insertion string(s): This product/program is licensed to: CONCORD COMMUNICATIONS INC Site ID: 0167267. The System event log contains nothing for Ingres. The newApplication log contains the same error message for Ingres. The only error message in the e< rrlog.log file is "E_GC0001_ASSOC_FAIL Association failure: partner abruptly released association". It looks like the message in the Application log happens when the Name Server trys to start... -Bob 11/1/2001 3:20:26 PM yzhang This ticket is getting more important because there is some money on the line this is on NT 48 patch 4 problem is that The Application log contains the following error for Ingres, (beside this every thing is running fine as Bob mentioned. errlog.log looks fine.) The description for Event ID ( 2003 ) in Source ( Ingres ) could not be found. It contains the following insertion string(s): This product/program is licensed to: CONCORD COMMUNICATIONS INC Site ID: 0167267. Do you know what this error means? I guess it might related to CA license. Thanks Yulun 11/1/2001 4:18:14 PM yzhang Bob, I actually checked with Robin regarding the message appeared in the application log. She said It isn't an error message at all. It is informational in nature. It's just saying that the db use is licensed to Concord for its product eHealth. So tell them don't worry about it, if this is the only problem they have, we might want to close the ticket. Yulun 11/2/2001 5:28:47 PM yzhang closed, it is not an error message [9/28/2001 2:36:12 PM rkeville The install failed, had a problem creating the redo.log file in the /export/platinum1/bd54.5/oradata/NHTD directory during the creation of the nethealth database. - Changed permissions on the /export/platinum1/bd54.5/oradata/NHTD dir to 777. - Permissions were rwx for Oracle and rx for everybody else. - Reran the db creation which failed due to an existing control file. - Deleted the control file and ran the creation successfully. ############################################################ 10/10/2001 9:29:23 AM sorr Can't reproduce this. 9/28/2001 4:13:55 PM rrick Problem: Customer did an init -6 to take down the server. I was going to have them delete, create and re-load from a good save. Died on nhDestroyDb. /nethealth/idb/ingres/files > nhDestroyDb nethealth nhDestroyDb requires INGRES to be running and accessible, but it doesn't seem to be. Please correct the following error returned by INGRES: E_LC0001 GCA protocol service (GCA_REQUEST) failure. Internal service status E_GC0139 -- No DBMS servers (for the specified database) are running in the target installation.. E_LQ0001 Failed to connect to DBMS session. -------------------------------------------------------------------------------------------------------------------------------------- Cannot Resize Ingres Tranaction log: ------------------------------------------------------------------------------------------------------------------------------------- Ingres/ingstart: Checking host "ftbca1" for system resources required to run Ingres... Unable to open kernel memory file /dev/kmem All shared memory resource checking has been disabled. Your system has sufficient resources to run Ingres. Starting your Ingres installation... Starting the name server... Allocating shared memory for logging and locking systems... Starting the recovery server...FAIL iirundbms: server would not start. II_SYSTEM must be set in your environment. Has the csinstall program been run? II_DATABASE, II_CHECKPOINT, II_JOURNAL and II_DUMP must also be set. See II_CONFIG/symbol.tbl. Check the file '/nethealth/idb/ingres/files/errlog.log' for more details concerning internal errors. See your Installation and Operations Guide for more information concerning server startup. The recovery server failed to start. --------------------------------------------------------------------------------------------------------------------------------- Files on Bafs/54000/54436. 10/1/2001 10:46:50 AM yzhang have customer try ingstop, ipcs, ipcclean, csinstall, ingstart. If it still does not work, try to start the each ingres process separately. I think you did the same thing before. Let me if you have any question 10/2/2001 12:45:11 PM wburke Destroyed recreate load. Everything looks OK. 10/3/2001 1:17:18 PM wburke NO bug, OK to close. 10/4/2001 10:21:54 AM wburke Almond called, OK to close.! 10/4/2001 10:24:09 AM yzhang problem no longer exist Pi10/2/2001 8:50:14 AM foconnor Customer is getting the error: E_CO003F COPY: Warning: 4 rows not copied because duplicate key detected. in their fetch logs. This is a repeat of the problem resolved in Problem ticket 14040. Because of another database issue an older datbase had to be loaded and the problem is back. Central ingres logs: //BAFS/escalated tickets/53000/53002/Sept28 Fetch log //BAFS/escalated tickets/53000/53002 - Customer system configuration: - Central machine: - OS: Windows NT4 SP5 - NH 4.7.1, P02 and D04 - NH=SERVER_ID=1 - UK Remote machine: - .OS: Windows NT4 SP5 - NH 4.7.1, P02 and D04 - NH=SERVER_ID=2 - HP Remote machine - OS: HP-UX B.10.20 - NH 4.7.1, P02 and D04 - NH=SERVER_ID=3 - During the merge portion of the fetch the following errors occur: Adding element data from files lat_b41.ascii/nea_b23.ascii/els_b45.ascii/mtf_b45.ascii ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (0 rows) E_CO003F COPY: Warning: 4431 rows not copied because duplicate key detected. E_CO0028 COPY: Warning: Copy completed with 1 warnings. 1477 rows successfully copied. (1477 rows) E_CO003F COPY: Warning: 7584 rows not copied because duplicate key detected. E_CO0028 COPY: Warning: Copy completed with 1 warnings. 2528 rows successfully copied. (2528 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (82 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (184 rows) Adding element data from files lat_b41/nea_b23/els_b45/mtf_b45 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (0 rows) E_CO003F COPY: Warning: 1296 rows not copied because duplicate key detected. E_CO0028 COPY: Warning: Copy completed with 1 warnings. 432 rows successfully copied. (432 rows) E_CO003F COPY: Warning: 2541 rows not copied because duplicate key detected. E_CO0028 COPY: Warning: Copy completed with 1 warnings. 847 rows successfully copied. 10/2/2001 11:03:03 AM yzhang Farrell, Can you collect the follwing: echo "select table_name, num_rows, storage_structure from iitables order by table_name\g" | sql $NH_RDBMS_NAME > table.out Thanks Yulun 10/9/2001 9:45:20 AM foconnor Yulun called me and says that this message is harmless. 10/9/2001 9:47:08 AM yzhang requested support to tell customer, that the warning message is harmless, but I need to figure out why there is some duplicates in the element related tables 10/10/2001 7:13:52 AM foconnor -----Original Message----- From: O'Connor, Farrell Sent: Wednesday, October 10, 2001 6:57 AM To: 'support@ipperformance.com.au' Cc: O'Connor, Farrell Subject: Call ticket 53002 Shane, We have examined the Warning messages that your customer has been seeing in the Fetch logs and these errors were found to be benign but we are still researching why the warnings are are being issued. We are going to keep this issue open until we can either resolve the reason for the warning messages. From the fetch logs: E_CO003F COPY: Warning: 1296 rows not copied because duplicate key detected. E_CO0028 COPY: Warning: Copy completed with 1 warnings. 432 rows successfully copied. (432 rows) E_CO003F COPY: Warning: 2541 rows not copied because duplicate key detected. E_CO0028 COPY: Warning: Copy completed with 1 warnings. 847 rows successfully copied. 11/2/2001 10:31:32 AM yzhang Can you get the following for me: 1) run a fetch on debug mode use sh -x nhFetchDb >& fetch.out (this is the command for Unix), send the out< put 2) from central site copy nh_elem_assoc, nh_elem_latency, nh_mtf, and nh_elem_alis into file, send the file (example: copy table nh_elem_assoc() into 'nh_elem_assoc.dat' \g) 3) send the file (els_b45,mtf_b45,nea_b23,lat_b41,tds_b30) on $NH_HOME/db/remotepOller/nethealth/remotehost/Remote.tdb... Thanks Yulun 11/7/2001 12:51:07 PM foconnor Received information from customer. //BAFS/escalated tickets/53000/53002/Nov06..... 11/15/2001 6:33:21 AM foconnor -----Original Message----- From: O'Connor, Farrell Sent: Thursday, November 15, 2001 6:24 AM To: Zhang, Yulun Cc: O'Connor, Farrell Subject: Problem ticket 18205; Call ticket 53002 Importance: High Yulun, Can I get an update on the status of problem ticket 18205? Regards, Farrell 11/19/2001 7:44:13 AM foconnor -----Original Message----- From: O'Connor, Farrell Sent: Monday, November 19, 2001 7:35 AM To: Zhang, Yulun Cc: O'Connor, Farrell Subject: Problem ticket 18205 Duplicate keys on fetches Yulun, Can I get an update on the status of problem ticket 18205? 11/19/2001 2:31:35 PM yzhang Farrell, The following is important part of fetch debug output, which will eaplain why there is duplicate in the nh_elem_* related tables when copying the file fetched from remote to central: for example for nh_elem_assoc table (for one remote site): on the remote, the nhFetchDb first delete * from nh_elem_assoc whereelement_id between 3000003 to 3003809, then the duplicate message appears when copy the file fetched from remote to the central. This mean there is element_id in 1000000 range both from central and and remote sites: Here is What to do: 1) find out if there are any element_ids from both central and remote in the 1000000 range from nh_elem_* table, If so, they need to run delete query to remove all of the 1000000 range element_id. 11/20/2001 6:18:56 AM foconnor -----Original Message----- From: Zhang, Yulun Sent: Monday, November 19, 2001 2:22 PM To: O'Connor, Farrell Subject: RE: Problem ticket 18205 Duplicate keys on fetches Farrell, The following is important part of fetch debug output, which will eaplain why there is duplicate in the nh_elem_* related tables when copying the file fetched from remote to central: for example for nh_elem_assoc table (for one remote site): on the remote, the nhFetchDb first delete * from nh_elem_assoc whereelement_id between 3000003 to 3003809, then the duplicate message appears when copy the file fetched from remote to the central. This mean there is element_id in 1000000 range both from central and and remote sites: Here is What to do: 1) find out if there are any element_ids from both central and remote in the 1000000 range from nh_elem_* table, If so, they need to run delete query to remove all of the 1000000 range element_id. Thanks Yulun 11/26/2001 9:52:04 AM foconnor -----Original Message----- From: O'Connor, Farrell Sent: Monday, November 26, 2001 9:42 AM To: Zhang, Yulun Cc: O'Connor, Farrell Subject: Call ticket 53002 Importance: High Yulun, There are not elements on either of the remotes with element id's in the 1,000,000 range. What do we do next? See //BAFS/escalated tickets/53000/53002/Nov26. 11/27/2001 12:57:18 PM yzhang Farrell, Please have customer do the following: echo " select name from nh_element where element_id < 2000000\g" | sql nethealth >name_file.txt nhDeleteElement name_file.txt. Please do a test before instructing the customer. Thanks Yulun 11/30/2001 7:58:38 AM foconnor -----Original Message----- From: O'Connor, Farrell Sent: Friday, November 30, 2001 7:48 AM To: Zhang, Yulun Cc: O'Connor, Farrell Subject: Problem ticket 18205 E_CO003F COPY: Warning: 1296 rows not copied because duplicate key detected Importance: High Yulun, You have mentioned that you want the contents of nh_elem_assoc table from the central and the remote. We already have the nh_elem_assoc.dat from the central in //BAFS/53000/53002/Nov6. Do we also need this from both remotes? Original fetch logs with problems: Job started by Scheduler at '13/8/2001 12:00:10 AM'. ----- ----- $NH_HOME/bin/nhFetchDb ----- ### Beginning Fetch Mon Aug 13 00:00:13 EST 2001 ENTRY> 150.227.103.204 health /opt/concord/nethealth/db/remotePoller nethealth Connecting to host 150.227.103.204 Host 150.227.103.204 is alive FTP connection successful to host 150.227.103.204 Copying files from host 150.227.103.204 150.227.103.204::/opt/concord/nethealth/db/remotePoller/Remote.tdb.08-12-2001_23.00.58 tar: blocksize = 20 Done copying files from 150.227.103.204 Disconnecting from host 150.227.103.204 Disconnected from host 150.227.103.204 ENTRY> 160.65.98.2 poller2 /DnethealthdbremotePoller nethealth Connecting to host 160.65.98.2 Host 160.65.98.2 is alive FTP connection successful to host 160.65.98.2 Copying files from host 160.65.98.2 160.65.98.2::/DnethealthdbremotePoller/Remote.tdb.08-12-2001_13.00.52 tar: blocksize = 20 Done copying files from 160.65.98.2 Disconnecting from host 160.65.98.2 Disconnected from host 160.65.98.2 ### Beginning Merge Mon Aug 13 00:06:42 EST 2001 Deleting the following element ids from the central database: INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (0 rows) From 3000003 to 3003446. Removing element and analyzed data after 997534543 INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (2528 rows) (2528 rows) (0 rows) (1477 rows) (2528 rows) Deleting the following element ids from the central database: From 2000243 to 2002632. Removing element and analyzed data after 997531188 INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (913 rows) (913 rows) (0 rows) (432 rows) (847 rows) Checking for duplicate element names and inserting elements ... Adding remote element association, element alias and latency data ... Adding element data from files lat_b41.ascii/nea_b23.ascii/els_b45.ascii/mtf_b45.ascii ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (0 rows) E_CO003F COPY: Warning: 4431 rows not copied because duplicate key detected. E_CO0028 COPY: Warning: Copy completed with 1 warnings. 1477 rows successfully copied. (1477 rows) E_CO003F COPY: Warning: 7584 rows not copied because duplicate key detected. E_CO0028 COPY: Warning: Copy completed with 1 warnings. 2528 rows successfully copied. (2528 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (82 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (184 rows) Adding element data from files lat_b41/nea_b23/els_b45/mtf_b45 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (0 rows) E_CO003F COPY: Warning: 1296 rows not copied because duplicate key detected. E_CO0028 COPY: Warning: Copy completed with 1 warnings. 432 rows successfully copied. (432 rows) E_CO003F COPY: Warning: 2541 rows not copied because duplicate key detected. E_CO0028 COPY: Warning: Copy completed with 1 warnings. 847 rows successfully copied. (847 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (82 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (100 rows) Logging elements deleted at the remote sites ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (0 rows) (0 rows) (0 rows) No cleanup of poller configuration file required. INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (0 rows) (0 rows) (0 rows) No cleanup of poller configuration file required. Updating servers with changes ... Creating database indexes on table nh_stats0_997534799 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c)< 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_997538399 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_997541999 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_997545599 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_997549199 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_997552799 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_997556399 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_997559999 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_997563599 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_997567199 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_997570799 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_997574399 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_997577999 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_997581599 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_997585199 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_997588799 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_997592399 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_997595999 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_997599599 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_997603199 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_997606799 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_997610399 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_997613999 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_997617599 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_997621199 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_997624799 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Cleaning up merge files [: mergeDbCleanup 1394: mergeDatabase 1339: mergeDbs 1037: C:/nethealth/bin/nhFetchDb.sh 1882: expression syntax error [: mergeDbCleanup 1394: mergeDatabase 1339: mergeDbs 1037: C:/nethealth/bin/nhFetchDb.sh 1882: expression syntax error [: mergeDbCleanup 1394: mergeDatabase 1339: mergeDbs 1037: C:/nethealth/bin/nhFetchDb.sh 1882: expression syntax error [: mergeDbCleanup 1394: mergeDatabase 1339: mergeDbs 1037: C:/nethealth/bin/nhFetchDb.sh 1882: expression syntax error [: mergeDbCleanup 1394: mergeDatabase 1339: mergeDbs 1037: C:/nethealth/bin/nhFetchDb.sh 1882: expression syntax error Done merging database nethealth. ### Done Mon Aug 13 00:23:50 EST 2001 ----- Scheduled Job ended at '13/8/2001 12:23:50 AM'. 12/3/2001 9:42:08 AM yzhang on the folder Nov06, there is a set of nh_elem*.dat, and on the subfolder (nov06137) there is another set of nh_elem_.dat, which one is central, which one is remote? Thanks Yulun 12/3/2001 10:37:00 AM yzhang Can you collect the following from central and each of the remotes echo " copy table nh_elem_assoc() into 'nh_elem_assoc_central.dat'\g" | sql nethealth echo " copy table nh_elem_assoc() into 'nh_elem_assoc_remote1.dat '\g" | sql nethealth echo " copy table nh_elem_assoc() into 'nh_elem_assoc_remote2.dat'\g" | sql nethealth Thanks Yulun 12/7/2001 6:56:34 AM foconnor Received information //BAFS/escalated tickets/53000/53002/Dec07 12/11/2001 5:58:16 AM foconnor -----Original Message----- From: Ursula.Kors@za.didata.com [mailto:Ursula.Kors@za.didata.com] Sent: Tuesday, December 11, 2001 5:31 AM To: foconnor@concord.com Subject: RE: Call ticket 55292 Importance: High Hi Farrell Do you have an update for me on this? 12/11/2001 5:49:35 PM yzhang Farrel, Looks their database is still in mess, following are the output of nh_elem_assoc table from their remo< te2. The same information contained in the nh_elem_assoc table in the central machine. we have two options 1) clean their databse using nhDeleteElem as I requested before, combined with some query. you can determine what should be deleted. 2) If you think the first option is hard, have customer do a dbsave on each of the remote and central, obtain the db, and load them locally in house. So we can clean it in house, and we can do more study on their db regarding why they obtained such a strange id. element_id num of count with this id x -706081024x 2x x -639824128x 2x x -623046912x 2x x -606269696x 2x x -589492480x 2x x -522383616x 2x x -505606400x 2x x -136507648x 2x x -35844352x 2x x -19067136x 2x x 1273171712x 2x x 1289948928x 2x x 1306726144x 2x x 1323503360x 2x x 1340280576x 2x x 1357057792x 2x x 1675104000x 2x x 1708658432x 2x x 1742212864x 2x 12/12/2001 2:19:58 PM yzhang yes, you can do the clean based on what you mentioned. prior to cleaning, check with them to find out why some strange element_id , including negative values, was getting into database. have they done anything unusual. The things I worried about is that we may remove the elements they want to keep (e,g. the -233300003 (element_id) may associate with correct element that they want to keep). check with customer on this. Thanks Yulun 12/13/2001 2:07:51 PM foconnor -----Original Message----- From: O'Connor, Farrell Sent: Thursday, December 13, 2001 1:57 PM To: Zhang, Yulun Cc: O'Connor, Farrell Subject: Problem ticket 18205 Yulun, The customer is not willing to send us their databases (corporate policy). Are their scripts we can create to remove the bad element id's. 12/13/2001 5:36:08 PM yzhang grape script from ~yzhang/scripts/getr_elem_related.sh, have customer place it under NH_HOME, check there is $NH_HOME/tmp directory. then run the script just typing the script name. after finishing there should be five dat files under the tmp. Tar up the dat file. and put on FTP incoming site. do this for each of the remote and central, give different tar file name for each system also please check with customer that the concord.out under nh_home has no error. before they posting data to the ftp site. Please get this as soon as possible, I will be taking vacation from Wednesday next week. Thanks Yulun 1/2/2002 10:18:42 AM foconnor -----Original Message----- From: Shane Burdan [mailto:s_burdan@hotmail.com] Sent: Wednesday, January 02, 2002 12:00 AM To: support@concord.com Cc: dscott@ipperformance.com.au Subject: Ticket: 53002 Hi Farrell, In addition to the mail just sent, also upon investigation there appears to be two files that we think should get deleted at the end of the fetch but are not, they are smt_b23 and smt_b23.ascii, one off the UK NT and one off the HP-UX. They don't seem to cause any error messages in the Fetch, but we suspect that they were overlooked in the delete processing in the script. We would need to modify nhFetchDb to remove them, same as was done for the other files. Can you please advise whether these should be deleted or not, and if they should be, please advise so that either you or us can make the neccessary changes to the nhFetchDb.sh 1/2/2002 10:20:24 AM foconnor -----Original Message----- From: Shane Burdan [mailto:s_burdan@hotmail.com] Sent: Tuesday, January 01, 2002 11:39 PM To: support@concord.com Cc: dscott@ipperformance.com.au Subject: Ticket: 53002 Hi Farrell, Attached is the .dat file you requested from the last machine in ANZ's remote polling environment, the UK remote poller. In addition to your investigations of the three .dat file that you now have, we also require information regarding a modified nhFetchDb.sh script that ANZ are now using. This script (attached) was modified by Raj Jathar of Concord to explicitly remove the files that were not being removed during fetch time and thus causing duplicates to occur. So far, this has tested successfully however we need to know what will happen in regards to this modified script. Obviously, in the event of an upgrade, this script will be overwritten unless stored elsewhere and then replaced later, however perhaps it would be even better to include the changed made in the regular production version of nhFetchDb. Please examine the changes made and let us know the impact of an upgrade and whether or not we can simply replace the newly upgraded nhFetchDb.sh with this one, or if you wish to let engineering and product management know of the changes so as to include them in the general release. This information is quite important as we expect ANZ to upgrade to version 5.0.2 shortly, and they will require the current nhFetchDb.sh as well, or the version 5.0.2 nhFetchDb.sh to incorporate the changes made. Please let me know your findings with, 1) The three .dat files 2) The modified nhFetchDb.sh and how we can set about upgrading with this in place. 1/3/2002 7:44:37 AM foconnor Files are on //BAFS/escalated tickets/53000/53002/20Dec01 1/22/2002 11:21:08 AM yzhang Farrell, Can you find out the following from customer: 1) whatis their status of upgrading to 5.0 2) what kind of change they have made to nhFetchDb Thanks Yulun 1/28/2002 7:25:24 AM foconnor They are going to 5.0.2 in April have not received the nhFetchDb script. 1/28/2002 7:26:25 AM foconnor Yulun, The customer is not seeing the above error anymore but would like an investigation. Customer is getting the error: E_CO003F COPY: Warning: 4 rows not copied because duplicate key detected. in their fetch logs. This is a repeat of the problem resolved in Problem ticket 14040. Because of another database issue an older datbase had to be loaded and the problem is back. 2/6/2002 12:02:54 PM yzhang Did you talk to customer if we can close this one, I think you owns the explanation regarding duplicate 2/15/2002 1:55:34 PM foconnor I need an explanation specific to fetches not general duplicates. 2/19/2002 4:51:40 PM yzhang I checked with Jason, and Jason checked with Bob about the email I sent out regarding how to handle the duplicate for distributed poll. the both did not find. I actually have a bug for this, will give you the explanation after I reproduce the problem. repeat of 20884 Yulun 10/2/2001 4:05:06 PM nalarid Customer is intermitantly receiving the following error messge upon console initialization: Server stopped unexpectedly, restarting The errlog.log is not showing any indication of the event, nor is the system messages log. The Maintenance log showed the following error: Starting Network Health servers. Error: Unable to send message to another process - you may need to restart the Network Health server (Broken pipe). The dbCollect.tar is showing nothing authored by this occurance. This error has generated five times in the past six weeks, with no pattern in its occurance. The customer reports no major changes to the machine before the error messages began appearing, however is experiencing a very slow startup for the Poller afterward. All relevant logs are located on BAFS, Ticket # 54637 10/3/2001 12:30:26 PM klandry Asccociating ticket #54596 10/11/2001 5:41:36 PM yzhang requested the latest output of nhReset 10/11/2001 6:21:59 PM yzhang Colin, Can you set NH_RESET_INGRES =1 on one of your nh48 on solaris, then run nhReset > nhReset.out, and send me nhReset.out. This customer has problem of flashing the dbcach. Thanks Yulun 10/12/2001 10:57:44 AM yzhang Tapio, do you still receiving the following error messge upon console initialization: Server stopped unexpectedly, restarting Thanks Yulun 10/15/2001 4:34:59 PM yzhang the orig< inal error message has gone after properly running the nhReset, ticket closed 10/4/2001 9:13:06 AM cestep Customer had errors in the errlog.log for about a month regarding the Ingres license. Now, when we stop the database and try to restart it, it fails saying that the database is not licensed. From the errlog.log: E_CL2659_CI_TERMINATE Computer Associates Licensing -2H30 - License Failure. Terminating... Please run the appropriate license program to properly license your product. LRF=2H30, 00508b2c4ddc, PC_686_1_448, PAVO, 0 E_CL2659_CI_TERMINATE Computer Associates Licensing -2H30 - License Failure. Terminating... Please run the appropriate license program to properly license your product. LRF=2H30, 00508b2c4ddc, PC_686_1_448, PAVO, 0 PAVO ::[II\IINMSVR , 00000000]: Tue Oct 02 09:26:36 2001 E_GC0151_GCN_STARTUP Name Server normal startup. PAVO ::[II\IINMSVR , 00000000]: Tue Oct 02 09:27:36 2001 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. PAVO ::[II\IINMSVR , 00000000]: Tue Oct 02 09:28:46 2001 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. PAVO ::[II\IINMSVR , 00000000]: Tue Oct 02 09:28:46 2001 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. E_CL2659_CI_TERMINATE Computer Associates Licensing -2H30 - License Failure. Terminating... Please run the appropriate license program to properly license your product. LRF=2H30, 00508b2c4ddc, PC_686_1_448, PAVO, 0 However, he has the CA_LIC directory on the C drive. I obtained this directory and implemented it here, without reproducing the issue. The license files themselves appear to be fine. CA's support website indicates that there is a key generated on install that needs to match up with the license file. If this key became corrupted, would there be a way to find out? All files on BAFS, under ticket #54768. 10/4/2001 9:39:33 AM rtrei Ok Yulun-- Send this one to CA. See if they can help us recover these situations. 10/4/2001 3:08:38 PM yzhang Can you get the following for this ticket: 1) errlog.log 2) find out when they started runing nethealth on the PC, do they have any smilar issue, or the other ingres issue before 3) run winmsd/file/savereport from start, and the saved file yulun 10/5/2001 7:30:50 AM cestep -----Original Message----- From: Hans Goossens [mailto:HansG@simac.nl] Sent: Friday, October 05, 2001 3:25 AM To: 'support@concord.com' Cc: 'support@smtware.com'; 'mario.robers@smtware.com' Subject: FW: FWSMTI-00228: Ticket #54768 - License failure on Ingres Hi Collin, After your last email yesterday, I start looking at the ingres newsgroup on the Internet and found the solution to the problem. In the NT registry there should be a key which hold the location of the CA license, and this key was pointing to the wrong location.... After changing this key the ingres database came alive and the first small reports appeared on the screen. The reason the NT registry key was changed was, propably, because an Arcserve application was installed/activated (the registry key was pointing to C:\ARCSERVE\CA_LIC instead of C:\CA_LIC) Collin, Mario - Thanx for all your great help !!!! PS: the dbSave just completed :) Regards Hans --------------------------------------------------------------- Problem resolved. 10/9/2001 12:37:00 PM yzhang problem was solved by changing the registy key to point to the correct location 10/4/2001 1:17:10 PM rkeville After rebooting the system the NHTD instance does not start automaticaly after reboot, causing nethealth server to fail at start command. Error: - "Fatal Error: Assertion for 'db->isConnected ()' failed, exiting (in file ../SvrApp.C, line 629)." Process: - # /etc/init.d/nethealth.sh stop eHealth servers were not running. eHealth stopped successfully - # /etc/init.d/nethealth.sh start Starting eHealth servers. eHealth restarted successfully # Fatal Error: Assertion for 'db->isConnected ()' failed, exiting (in file ../SvrApp.C, line 629). Workaround: - Login as oracle user and source $NH_HOME/nethealthrc.csh. - Run command: - $ORACLE_HOME/bin/svrmgrl - From the Server Manager prompt enter: - connect sys/change_on_install as sysdba - After you are connected enter: - Startup - The database should be started and mounted. - Login as the $NH_USER and source $NH_HOME/nethealthrc.csh. - Run command: - nhServer start - nhServer started Summary: - We need to start the instance automaticaly after reboot. #################################################################### 10/5/2001 9:43:23 AM wzingher repeat of 18096 which was just fixed. Try installing a new kit. ss10/5/2001 2:32:45 PM wburke Stack dumps and cores all over the place. See PT# 15709 for beginning of this problem. dB team unlimited stack size and that seemed to work for a couple of weeks. However Ingres is now crashing with DMF errors, stack dumps and exception errors. >>>CS_SCB found at 01BC0220<<< cs_next: 01BF5040 cs_prev: 01BB5040 cs_length: 15108. cs_type: FFFFABCD cs_self: 00000326 (806.) cs_stk_size: 00000000 (0.) cs_state: CS_COMPUTABLE (00000001) cs_mask: (00000000) cs_mode: CS_INPUT(00000002) cs_nmode: CS_OUTPUT(00000003) cs_thread_type: CS_NORMAL(00000000) cs_username: administrator cs_sem_count: 00000000 ----------------------------------- Stack trace beginning at 7025fac3 Stack dmp name II\INGRES\1dd pid 477 session 326: 7025b1f7: (OIDMFNT,Base:701e0000)7025f480( 00e613d0 00000000 00000000 000000 5b 001b3e43 440ad600 ) Stack dmp name II\INGRES\1dd pid 477 session 326: 70257535: (OIDMFNT,Base:701e0000)7025aa35( 016a2ee0 00000028 00000102 000000 5b 00000000 001b3e43 0172ed00 440ad3ac 00000200 440ad600 ) Stack dmp name II\INGRES\1dd pid 477 session 326: 70256a77: (OIDMFNT,Base:701e0000)702571b2( 016a2ee0 00000028 00000000 000002 00 0000005b 001b3e43 0172ed00 00000000 701d1024 0172ed0c 440ad600 ) Stack dmp name II\INGRES\1dd pid 477 session 326: 702bbb4a: (OIDMFNT,Base:701e0000)702566de( 0172ece0 00000028 00000200 0172ed 0c 440ad600 ) Stack dmp name II\INGRES\1dd pid 477 session 326: 702b7bb9: (OIDMFNT,Base:701e0000)702bb9e6( 0172ece0 0163a5f8 440ad54c 440ad5 54 440ad588 440ad600 ) Stack dmp name II\INGRES\1dd pid 477 session 326: 702d754e: (OIDMFNT,Base:701e0000)702b7b39( 0172ece0 01640678 0163a5f8 000000 01 440ad600 ) Stack dmp name II\INGRES\1dd pid 477 session 326: 7022ad79: (OIDMFNT,Base:701e0000)702d73b9( 0172ece0 01640678 00000100 0163a5 f8 440ad600 ) Stack dmp name II\INGRES\1dd pid 477 session 326: 701f78ac: (OIDMFNT,Base:701e0000)7022aaf0( 01640648 ) Stack dmp name II\INGRES\1dd pid 477 session 326: 7078756a: (OIDMFNT,Base:701e0000)701f7740( 0000001f 01640648 ) Stack dmp name II\INGRES\1dd pid 477 session 326: 7078bce8: ????????( 0154dfa0 01bc9380 00000001 ) Stack dmp name II\INGRES\1dd pid 477 session 326: 7078bce8: ????????( 0154df30 01bc9380 00000001 ) Stack dmp name II\INGRES\1dd pid 477 session 326: 707a8689: ????????( 0154e700 01bc9380 00000001 ) Stack dmp name II\INGRES\1dd pid 477 session 326: 707af649: (OIQEFNT,Base:70760000)707a83a0( 0154e644 01bc9380 00000001 ) Stack dmp name II\INGRES\1dd pid 477 session 326: 7077271f: (OIQEFNT,Base:70760000)707ae782( 01bc9380 00000010 ) Stack dmp name II\INGRES\1dd pid 477 session 326: 708f314f: (OIQEFNT,Base:70760000)70771460( 00000010 01bc9380 ) Stack dmp name II\INGRES\1dd pid 477 session 326: 70146ed9: ????????( 00000002 01bc0220 01bc0260 ) Stack dmp name II\INGRES\1dd pid 477 session 326: lstrcmpiW: ????????( ) 00000326 General Protection Exception @7025fac3 SP:440aceb0 BP:440ad084 AX:0 CX:39230596 DX:59f526c BX:1bc0220 SI:1774d00 DI:1bc9f3c 00000326 Thu Oct 04 01:00:44 2001 E_DM9049_UNKNOWN_EXCEPTION An Unexpected Exception occurred in the DMF Facility, exception number 68197. KINGKONG::[II\INGRES\< 1dd , 00000326]: An error occurred in the following session: KINGKONG::[II\INGRES\1dd , 00000326]: >>>>>Session 00000326<<<<< KINGKONG::[II\INGRES\1dd , 00000326]: DB Name: nethealth (Owned by: administrator ) KINGKONG::[II\INGRES\1dd , 00000326]: User: administrator (administrator ) KINGKONG::[II\INGRES\1dd , 00000326]: User Name at Session Startup: administrator KINGKONG::[II\INGRES\1dd , 00000326]: Terminal: console KINGKONG::[II\INGRES\1dd , 00000326]: Group Id: KINGKONG::[II\INGRES\1dd , 00000326]: Role Id: KINGKONG::[II\INGRES\1dd , 00000326]: Application Code: 00000000 Current Facility: QEF (00000006) KINGKONG::[II\INGRES\1dd , 00000326]: Client user: administrator KINGKONG::[II\INGRES\1dd , 00000326]: Client host: KINGKONG1 KINGKONG::[II\INGRES\1dd , 00000326]: Client tty: KINGKONG1 KINGKONG::[II\INGRES\1dd , 00000326]: Client pid: 798 KINGKONG::[II\INGRES\1dd , 00000326]: Client connection target: nethealth KINGKONG::[II\INGRES\1dd , 00000326]: Client information: user='administrator',host='KINGKONG1',tty='KINGKONG1', pid=798,conn='nethealth' KINGKONG::[II\INGRES\1dd , 00000326]: Description: KINGKONG::[II\INGRES\1dd , 00000326]: Query: select count(*)from iicolumns where table_name='nh_element' and column_name='ip_address' Check the server error log. 00000326 Thu Oct 04 01:00:46 2001 E_DM004A_INTERNAL_ERROR Internal DMF error detected. 00000326 Thu Oct 04 01:00:46 2001 E_QE009C_UNKNOWN_ERROR Unexpected error received from another facility. Check the server error log. KINGKONG::[II\INGRES\1dd , 00000205]: Thu Oct 04 01:00:47 2001 E_SC0122_DB_CLOSE Error closing database. Name: nethealth Owner: administrator KINGKONG::[II\INGRES\1dd , 00000205]: Thu Oct 04 01:00:47 2001 E_SC010D_DB_LOCATION Database Location Name: $default Physical Specification: E:\nethealth\oping\ingres\data\default\nethealth Flags: 00000003 KINGKONG::[II\INGRES\1dd , 00000205]: Thu Oct 04 01:00:47 2001 E_SC0221_SERVER_ERROR_MAX Error count for server has been exceeded. KINGKONG::[II\INGRES\1dd , 00000205]: Thu Oct 04 01:00:48 2001 E_PS0501_SESSION_OPEN There were open sessions when trying to shut down the parser facility. KINGKONG::[II\INGRES\1dd , 00000205]: Thu Oct 04 01:00:49 2001 E_DM004A_INTERNAL_ERROR Internal DMF error detected. KINGKONG::[II\INGRES\1dd , 00000205]: Thu Oct 04 01:00:49 2001 E_SC0235_AVERAGE_ROWS On 13397. select/retrieve statements, the average row count returned was 3. KINGKONG::[II\INGRES\1dd , 00000205]: Thu Oct 04 01:00:49 2001 E_SC0127_SERVER_TERMINATE Error terminating Server. KINGKONG::[ , 00000000]: Thu Oct 04 05:03:24 2001 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. 10/5/2001 2:33:46 PM wburke BAFS/54838 10/5/2001 2:50:12 PM rtrei Yulun-- YOu did the right things in the previous ticket. We need to find out with CA what is going on here. Treat this one as escalated. 10/5/2001 5:18:49 PM yzhang created following issue with CA I created an issue(11278460) with you about a month ago regarding the stack dump and subsequently ingres crash, at that time, you recommanded to double the stack size of 131072 and turns the group buffer off. Our client did what you recommanded, and the system was up and running, but the stack dump occurs again with sigv core dump after about one month on the same system of the same customer. here is some error from errlog.log.This time, I would like you to consider some more general solution, and give more clear picture regarding the relationship between stack dump, database inconsistant, dead lock, and so on. I think this is a serious problem we need to deal with. our customer is down.Let me know if need any informationThanks Yulun 10/9/2001 9:27:16 AM yzhang Can you have customer send patch.txt under \oping20\ingres, and oidmfnt.dll under oping20\ingres\bin. note patch.txt is ascii file and oidmfnt.dll is binary file, have customer send those files as soon as possible. Thanks Yulun 10/9/2001 9:39:49 AM wburke -----Original Message----- From: Burke, Walter Sent: Tuesday, October 09, 2001 9:23 AM To: 'kwok.lee@chase.com' Cc: Nadasi, John Subject: FW: 18326/54838 Hi Kwok, Engineering has requested the following: patch.txt under \oping20\ingres and oidmfnt.dll under oping20\ingres\bin. 10/9/2001 10:52:16 AM wburke ----Original Message----- From: KWOK.LEE@chase.com [mailto:KWOK.LEE@chase.com] Sent: Tuesday, October 09, 2001 9:56 AM To: support@concord.com Cc: wburke@concord.com Subject: Re: FW: 18326/54838 Walter, Attached are the files requested. 10/11/2001 10:36:24 AM yzhang called CA again, the technican assigned to this issue is out, and CA will have another technican look at this problem 10/11/2001 1:22:22 PM wburke -----Original Message----- From: Burke, Walter Sent: Thursday, October 11, 2001 1:05 PM To: 'kwok.lee@chase.com' Cc: Nadasi, John Subject: Ticket # 54838 Kwok, Currently we are working with CA ingres in order to determine the cause of the Db failures. I will contact as soon as I have more information from CA. -Walter 10/11/2001 4:36:52 PM yzhang for the first two, the customer can do the following, from iivdb.log, can yiou tell me the detail procedure rgarding how to get drwtsn32.log,user.dmp, memory.dmp. Hope I can hear from you today. Thanks Yulun 1) verifydb -mreport -sdbname iidbdb -odbms_catalogs 2)verifydb -mreport -sdbname nethealth -otable nh_element 10/12/2001 9:59:07 AM yzhang After talking to cumputer asscociate, they the following information. 1. verifydb in report mode against system catalogs verifydb -mreport -sdbname iidbdb -odbms_catalogs 2. verfiydb in report mode against table nh_element verifydb -mreport -sdbname nethealth -otable nh_element then collect the iivdb.log 3. And I would need the following files 1) the procedures or query of nhquery4 and nh_stats0_998866799qc 2) the dr.watson log named drwtsn32.log 3) the user.dmp 4) the memory.dmp Here is a way to collect item 3 Dr. Watson log is under WINNT\SYSTEM32 directory named drwtsn32.log As for user.dmp and memory.dmp, it will depend on the system configureation on your client's box, usually they are available under winnt folder. But this may be differ depend on the environment variable set. So the easiest way to do so is using the 'Find files or folders' function from Start -> Search and looking for *.dmp file. Thanks Yulun 10/15/2001 5:43:08 PM yzhang Here is the executable for checksum. you can run it by executing 'plgen' %II_SYSTEM%\ingres\bin\iidbms.exe or rename it to checksum or sum and run 'sum' %II_SYSTEM%\ingres\bin\iidbms.exe Regards, Anton 10/15/2001 6:26:20 PM yzhang requested the following from customer 1) errlog.log starting Oct 1, 2001 up to today 2) config.dat 3) config.log 4) protect.dat ---- all files are under %II_SYSTEM%\ingres\files 5) the output of the 'set' command, ie, set > set.out 6) the output of the 'ingprenv'command, ie, ingprenv > ingprenv.out 7) what NT service pack does your customer has installed on his PC ---- at the MS-DOS prompt type 'winver' or 'winmsd' 8) dir %II_SYSTEM%\ingres\bin > bin.out 9) dir %II_SYSTEM%\ingres\utility > utility.out 10) type %II_SYSTEM%\ingres\version.rel > versionrel.out 11) type %II_SYSTEM%\ingres\version.dat > versiondat.out 12) verifydb -mreport -sdbname "" -odbms_catalogs ---- will write the output %II_SYSTEM%\ingres\files\iivdb.log 13) verifydb -mreport -sdbname "" -otable nh_element 14) sum or checksum for iidbms.exe ---- for this one I 10/16/2001 9:30:49 AM wburke -----Original Message----- Fro< m: Burke, Walter Sent: Tuesday, October 16, 2001 9:13 AM To: 'kwok.lee@chase.com' Cc: Nadasi, John Subject: Ticket # 54838 - Ingres Failure Hi Kwok, Engineering and CA ingres has requested the following: 1) errlog.log starting Oct 1, 2001 up to today 2) config.dat 3) config.log 4) protect.dat ---- all files are under %II_SYSTEM%\ingres\files 5) the output of the 'set' command, ie, set > set.out 6) the output of the 'ingprenv'command, ie, ingprenv > ingprenv.out 7) what NT service pack does installed on the $NH_SERVER ---- at the MS-DOS prompt type 'winver' or 'winmsd' 8) dir %II_SYSTEM%\ingres\bin > bin.out 9) dir %II_SYSTEM%\ingres\utility > utility.out 10) type %II_SYSTEM%\ingres\version.rel > versionrel.out 11) type %II_SYSTEM%\ingres\version.dat > versiondat.out 12) verifydb -mreport -sdbname "" -odbms_catalogs ---- will write the output %II_SYSTEM%\ingres\files\iivdb.log 13) verifydb -mreport -sdbname "" -otable nh_element 14) sum or checksum for iidbms.exe ---- for this one I Sincerely, 10/16/2001 11:15:01 AM yzhang Walter, This is the information requested by CA, I know you have got some of them, Can you get the rest as soon as possible. Thanks Yulun 10/16/2001 2:43:43 PM wburke -----Original Message----- From: KWOK.LEE@chase.com [mailto:KWOK.LEE@chase.com] Sent: Tuesday, October 16, 2001 1:59 PM To: support@concord.com Subject: Re: FW: checksum for Step # 14 in 54838 Walter, Below is the checksum. ---------------------------------------------------------------------------------- E:\kl\temp>sum %II_SYSTEM%\ingres\bin\iidbms.exe E:\nethealth\oping\ingres\bin\iidbms.exe: size = 17920, checksum = 42949665 60 E:\kl\temp> 10/16/2001 4:16:14 PM wburke -----Original Message----- From: Burke, Walter Sent: Tuesday, October 16, 2001 3:59 PM To: Zhang, Yulun Subject: FW: 55209 - New Customer CAingres issue: Importance: High Yulun, note the customer could not find the following: config.log/protect.dat/version.rel/version.dat. - I searched with him to no avail. -Walter 10/16/2001 4:32:52 PM yzhang Anton, we have another customer has the exact same stack dump, here is the information we collected. be aware this is another customer, so don't confuse with the information I sent previousely. note this customer could not find the following: config.log/protect.dat/version.rel/version.dat. - I searched with him to no avail. Please let me know for any solution you have as soos as possible. Thanks 10/17/2001 10:54:21 AM wburke -----Original Message----- From: Burke, Walter Sent: Wednesday, October 17, 2001 10:37 AM To: 'kwok.lee@chase.com' Cc: Nadasi, John Subject: Ticket # 54838 Hi Kwok, I am awaiting the results of steps 1-13: If you have any questions please call. Sincerely, 10/17/2001 1:27:14 PM yzhang can you have the two customer run the following query select count (*) from iicolumns where table_name='nh_element' and column_name = 'ip_address' and send me the output, and watch to see if running the query will crash dbms server. Thanks Yulun 10/17/2001 1:29:31 PM wburke -----Original Message----- From: Burke, Walter Sent: Wednesday, October 17, 2001 1:11 PM To: 'hjohnson@network-guidance.com' Subject: Ticket # 55209 Howard, Please run the attached. Send output count.out in current working directory. << File: count.sh >> Thanks, 10/17/2001 1:30:48 PM wburke -----Original Message----- From: Burke, Walter Sent: Wednesday, October 17, 2001 1:13 PM To: 'kwok.lee@chase.com' Cc: Nadasi, John Subject: Ticket # 54838 - Ingres Stack Dump Kwok, Please run the attached. Send output count.out in current working directory. << File: count.sh >> Thanks, 10/17/2001 2:25:37 PM wburke 2nd Customer: INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. Ingres Microsoft Windows NT Version II 2.0/9808 (int.wnt/00) login Wed Oct 17 13:07:29 2001 continue * Executing . . . +-------------+ |col1 | +-------------+ | 1| +-------------+ (1 row) continue * Your SQL statement(s) have been committed. Ingres Version II 2.0/9808 (int.wnt/00) logout Wed Oct 17 13:07:29 2001 _____________________________________ 10/17/2001 3:02:02 PM wburke Obtained All Info for CA from Chase. BAFS/54838/10-16 10/18/2001 11:59:20 AM wburke -----Original Message----- From: Burke, Walter Sent: Thursday, October 18, 2001 11:42 AM To: 'hjohnson@network-guidance.com' Subject: 55209 - STACK DUMP AGAIN Howard, We require the following on this ticket. Can you check and get the following with the new customer, who has the stack dump problem. 1) get OIDMFNT.DLL from ingres/bin, just a copy. 2)ask if they made any change on config.dat without through cbf. is there a backup copy of config.dat in $NH_HOME/oping/ingres/files? i.e. config.dat.bak? what is the modified date on the config.dat? 3)get the output of sql nethealth (which shows the ingres version i.e. d:\sql nethealth > nh.out Thanks, Walter 10/19/2001 5:29:07 PM wburke -----Original Message----- From: Jeremy Klomp [mailto:JKlomp@network-guidance.com] Sent: Friday, October 19, 2001 5:18 PM To: 'Burke, Walter' Cc: Howard Johnson Subject: RE: 55209 - STACK DUMP AGAIN Walter, Here is the info you requested: 1)Included 2) The modified and created dates are: Created: September 20th 2:04 AM (this was the morning we had to rebuild from the virus) Modified: September 20th 2:10 AM (I am sure this is from the file being written to during the installation) 3)Included BAFS/55209/10-18-01 10/22/2001 9:43:42 AM yzhang Anton, Here is the libraries you requested, also whn I have customer do sql dbname, the output is following: INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. Ingres Microsoft Windows NT Version II 2.0/9808 (int.wnt/00) login Fri Oct 19 14:28:46 2001 continue * Ingres Version II 2.0/9808 (int.wnt/00) logout Fri Oct 19 16:08:53 2001 That mean the actual running version of Ingres is II 2.0/9808 -----Original Message----- From: Team PCS [mailto:PCS@ca.com] Sent: Wednesday, October 17, 2001 6:21 PM To: yzhang@concord.com Subject: CA Startrak Issue 11370508;01 - STACK DUMP AGAIN. All these information is coming from the latest customer who has the stack dump problem. Also I got you phone message, Let know the solution. Thanks 10/22/2001 3:36:05 PM wburke 55209 NOTE: -----Original Message----- From: Howard Johnson [mailto:hjohnson@network-guidance.com] Sent: Monday, October 22, 2001 3:08 PM To: 'Burke, Walter' Subject: RE: 54461 , 55209 Importance: High Walter, 1) Can I do numbers 6,7 and 9 with out running the nhserver stop command beforehand? I really do not want to stop the machine from polling. 2) We are using McAfee antivirus protection. We have on-access scanning running and it was setup to scan all files and nothing was excluded. We have found out now that this does lock files while scanning. I have now changed the setting to exclude scanning D:\nethealth completely. This is the home directory of nethealth. I have also done this for the system scan that runs every Sunday at 5pm. 3) We are also running the Arcserve open file agent that I had mentioned when this all first started. This is set to exclude backing up d:nethelath\db\data. Should we be excluding anyhting else? The actual backup software is on a different server, just the agent is on the Concord system. Please get back to me. _________________________________________________________________ 10/22/2001 6:04:40 PM yzhang What CA recommanded is to install the latestIngres patch (p6830). customer is on 48p4 on NT with II 2.0 /9808. I believe their patch is 6772 even though they don't have version.rel, version.dat file under ingres directory. I don't think CA made a correct recommendation. I want to know what you think before I rep< ly to CA. Yulun 10/23/2001 10:11:59 AM wburke -----Original Message----- From: Trei, Robin Sent: Monday, October 22, 2001 7:31 PM To: Burke, Walter; Zhang, Yulun Cc: Gray, Don Subject: RE: 18326 - Ingres Stack Dump Walter-- good investigation! They are all stack dumping on the same query. If we rewrite the query, we should be able to give these customers a test exe very quickly. Unfortunately, this is in the cdbLib which is called everywhere. Do we happen to know what is running when this error occurs? Yulun-- This is in wsCore\cdbLib\cdbDataList.sc. The function call is no longer needed so we can just have it return 0: Update: long isOldSchemaVersion () { EXEC SQL BEGIN DECLARE SECTION; int rowcnt = 0; EXEC SQL END DECLARE SECTION; EXEC SQL WHENEVER SQLERROR GOTO err_rtn; EXEC SQL SELECT COUNT(*) INTO :rowcnt FROM iicolumns WHERE table_name = 'nh_element' AND column_name = 'ip_address'; if (rowcnt == 0) return Yes; return No; err_rtn: EXEC SQL WHENEVER SQLERROR CONTINUE; EXEC SQL ROLLBACK WORK; return No; } to long isOldSchemaVersion () { return No; } 10/23/2001 11:32:09 AM wburke Obtained all requested info. BAFS/54838/10-16 BAFS/55209/10-16 and 10-18 BAFS/55410/10-22 10/24/2001 12:50:43 PM mfintonis Robin is working on getting an exe today. Changed to WIP per this morning's escalation meeting 10/24/2001 3:15:38 PM wburke Informed all customers, of WIP 10/25/2001 9:50:36 AM yzhang Walter, I knew that only one of the stack dump customers doubled the stack size to 262144. The other two still with default of 131072. you can have these two set the following parameter to 262144 through cbf. ii.daytona.dbms.*.stack_size: 131072 Yulun 10/25/2001 4:17:26 PM wburke -----Original Message----- From: Burke, Walter Sent: Thursday, October 25, 2001 4:03 PM To: 'kwok.lee@chase.com' Cc: Nadasi, John Subject: Ticket # 54838 - Ingres Stack Kwok, We have provided 2 executables which should resolve the ingres stack dump issue. ftp ftp.concord.com login: anonymous pass: ident cd outgoing bin mget 18326* get 18326nhReport.exe get 18326nhiStandardReport.exe rename $NH_HOME/bin/nhReport.exe to *.bak copy 18326nhReport into $NH_HOME/bin/nhReport nhiStandardReport goes into $NH_HOME/bin/sys make a backup of the original. Stop and start the server. Monitor. Sincerely, 10/26/2001 10:04:13 AM wburke -----Original Message----- From: Martinez, Ruben [mailto:RMartinez@marathonoil.com] Sent: Friday, October 26, 2001 9:39 AM To: support@concord.com Subject: RE: Ticket # 55410 - Ingres Stack Walter, I have applied these files as of 9:30am today and will monitor the system for what I hope to be no problems. Thanks for your help, Ruben 10/30/2001 11:26:16 AM wburke -----Original Message----- From: Burke, Walter Sent: Tuesday, October 30, 2001 11:18 AM To: 'kwok.lee@chase.com' Cc: Ramsey, Susan Subject: Ticket # 54838 - dbCrash to nhiReport failure Hi Kwok, Since loading the one-off executables, have we seen the issue arise again? Sincerely, 10/30/2001 11:58:26 AM wburke -----Original Message----- From: Burke, Walter Sent: Tuesday, October 30, 2001 11:50 AM To: 'hjohnson@network-guidance.com' Subject: Ticket # 55209 Howard, Have Ingres stopped/stack dumped will running any reports since the install of the new exectuables? Sincerely, 10/30/2001 12:02:55 PM wburke -----Original Message----- From: Howard Johnson [mailto:hjohnson@network-guidance.com] Sent: Tuesday, October 30, 2001 11:58 AM To: 'Burke, Walter' Subject: RE: Ticket # 55209 No not yet. We will be running a heavy dose of reports this Tursday morning due to the fact that it is the end of the month. That will be a good test for this _____________________________________________________________ Reports have not failed nor ingres crash as of yet... 10/31/2001 12:07:11 PM wburke Kwok will run month end reports on Friday. 11/1/2001 10:01:59 AM wburke One customer reports no failures. -----Original Message----- From: Martinez, Ruben [mailto:RMartinez@marathonoil.com] Sent: Thursday, November 01, 2001 8:53 AM To: support@concord.com Subject: RE: Ticket # 55410 Walter, In response to your phone inquiry about this ticket, so far we have not encountered any further problems. My guess is that you can close this ticket for now if you wish and I will continue to monitor it. Thanks, Ruben Martinez ____________________________________ Waiting on Chase. They are the Big one. 11/2/2001 1:18:36 PM wburke -----Original Message----- From: Howard Johnson [mailto:hjohnson@network-guidance.com] Sent: Friday, November 02, 2001 1:10 PM To: 'support@concord.com' Subject: RE: Ticket # 55461/55209 Importance: High Walter, I just had a Dr. watson on the Nethealth box that caused nethealth to go all the way down. The Doctor watson is on nhiStdreport.ex.exe. This is not a misprint, that is actually what it says for the process. I have included the log with some other files for you to examine. During the time of this dr. watson, I was running some process reports out of the web interface for Advantage view version 1.2 patch 1. This caused the Pearl.exe process to peg out the CPU and hold it there until we killed it. I called Tech support to open a ticket on this, and they had said that nethealth nor advantage view uses pearl.exe. Obviously it does, because I ran a report on Advantage view again and Pearl.exe showed up again. It ran fine that time. I am worried about two things, first, why is it saying that the nhistdreport.exe is nhiStdreport.ex.exe??? In the folder, it only says that it is .exe. Also, what happened with the Pearl.exe process? I was running and printing several reports from advantage view at the time that this happened, could something have hung?? Is there any known issues on this? If any of this is an additional ticket, please forgive me. Please get back to me when you have a chance. Thank you 11/2/2001 2:36:31 PM wburke From: Howard Johnson [mailto:hjohnson@network-guidance.com] Sent: Friday, November 02, 2001 1:51 PM To: 'Burke, Walter' Subject: RE: Ticket # 55461/55209 Walter, please note that at the time of these failures, our PDC which has the print spooler had ran out of space because our printer had stopped, Once I started the printer backup, the reports seemed to print fine. I do not know if that would cause the dr watsons that we saw on that day or not 11/5/2001 10:27:30 AM yzhang waiting for customer's reply 11/5/2001 3:52:14 PM yzhang I reviewed the information for possible core dump due to ingres or reporting. There is no clear clue that the core dump is comming from Ingres or reporting. But we need to find out what make the core dump. please check with customer for the following: 1) their ingres has not been running from 11/2/01 10:00 pm to 11/5/01. what is happening? only error message in errlog.log is E_GC0001_ASSOC_FAIL, it is I believe due to ilegal shutdown of ingres. 2 )when is the core file created, and what location (what directory) customer find the core file. 3) why there is no mantaince log for 4/11/01 4) check what kind of ingres processes it is running. Thanks Yulun 11/6/2001 3:00:23 PM wburke Obtianed BAFS/54838/11-6-01 -----Original Message----- From: Susan Ramsey [mailto:sramsey@concord.com] Sent: Tuesday, November 06, 2001 2:21 PM To: 'Burke, Walter'; support@concord.com Subject: RE: Ticket # 54838 - dbCrash to nhiReport failure Importance: High have we heard back from Kwok - this must be taken care of (according to kwok's management ) by End of day on Thursday because of DAY 2 merger activities (chase and JP morgan) S 11/6/2001 4:07:54 PM yzhang Walter you also need to get poller.status.log from log directory, in addtion to the nhiPoller advanced log. Don, This one turns out to be the poller problem, I suggest re-assign to Dave shepard Thanks Yulun 11/6/200< 1 4:08:16 PM yzhang collecting information from customer 11/7/2001 10:28:11 AM wburke CPU - the server has 4 CPUs, each running at 400 MHz. RAM - it has 1 GByte of physical memory SWAP : Physical Memory (K) Total: 1,047,988 Available: 782,708 File Cache: 34,708 Pagefile Space (K) Total: 1,810,432 Total in use: 231,924 Peak: 276,324 C:\pagefile.sys Total: 786,432 Total in use: 145,312 Peak: 172,808 E:\pagefile.sys Total: 1,024,000 Total in use: 86,612 Peak: 103,516 Thanks, Kwok 11/7/2001 10:44:19 AM yzhang fixed, by increasing stack size and building the new report executable, the fix will goes to 48 patch 8 11/7/2001 11:38:53 AM mmcnally Same error received poller crashing call ticket 55744 Failed to establish RPC connection to NuTCRACKER Service (error=1702). [nhintutil.exe (xftconsole.cpp:919) PID=560 TID=303] 11/7/2001 11:53:24 AM mfintonis If this is waiting to be put into a patch it needs to remain in Field Test until it has been completely and successfully merged into a patch 11/9/2001 11:01:46 AM wburke ---Original Message----- From: Burke, Walter Sent: Friday, November 09, 2001 10:53 AM To: Zhang, Yulun Cc: Trei, Robin Subject: Ticket # 54838 Yulun, This ticket can be closed. Both customers, since installing the new exe's have not had ingres stack dumps on nhiReport. Note I am opening a new ticket for the Dr. Watson errors which is a different problem. -WAlter 11/9/2001 11:06:11 AM yzhang new report executable seems working 10/12/2001 11:03:24 AM wburke have an idea on how to extricate ourselves from the current situation which I want to run past you guys for verification before proceeding (it's much safer than what I suggested before). We still have a valid nhSaveDb from Friday Oct 6 (before the ill-fated fetch) which would contain up to and including Thursday night's fetch. We also have the as-polled data on the pollers going back to Wednesday Oct 4 (late PM). Wouldn't it make sense to do a save NOW (known state) then restore Friday night's saved DB then move aside the remote save directory on each of the pollers, perform a remote save on each poller (to a clean directory) and then re-fetch all the raw data? We have extended the as-polled rollup to 14 days on each of the pollers so as not to lose any data, but it can't go beyond this due to disk limits. I have estimated the entire process should take about 12hrs. Do you agree with this plan? Can you see any significant risks associated with this? ** We require your EMail response by close of business Boston Time Friday 12th October Attached is the latest dialog.txt which records all the actions taken to date. <> (NOTE this supercedes previous versions of the file) 10/12/2001 11:20:43 AM yzhang Walter here is the steps: 1) write a script to unload all of the stats0 tables from source machine 2)ftp the unloaded files (binary files) 3) write a script to load the file ftped from source machine 4) update the nh_rlp_boundary table I thinl Bob can help you with the scriptting if he is available 10/12/2001 5:58:25 PM wburke -----Original Message----- From: Burke, Walter Sent: Friday, October 12, 2001 5:32 PM To: 'Graham.Hughes@team.telstra.com'; 'Simon.Marko@team.telstra.com' Cc: Glasheen, Rick; Beale, Greg; Kalda, Andrew; 'support@ipperformance.com.au' Subject: Ticket # 54938 -Db Merge. Please disregard previous incomplete version. Graham, Simon, The following is the plan. It is a straight-forward uncomplicated method of replacing the missing data. Be sure to follow all normal precautions. Please note, these scripts where based on the data supplied by Telstra regarding the stats0 tables desired. NOTE: as with all scripts, ensure no ^M characters and execute permission are there. 1. Save the database in a safe location prior to the adjustment. 2. Run tableOut.sh on backupdb. this will create $NH_HOME/tmp/tables directory and output there. ( this extracts binary data (.dat) files from stats0 tables.) 3. run rlpOut.sh on backup. this will output nh_rlp_boundary.dat in $NH_HOME/tmp/tables directory. ( this extracts the rollup boundary tables from the backup machine. This is needed in order to correctly rollup the data.) 4. tar the tables directory on the backup machine and FTP in BINARY to the $NH_HOME/tmp directory of the current machine. Untar file. 5. Place both tableIn.sh and rlpIn.sh into the $NH_HOME/tmp directory. 6. nhServer stop. 7. Run tableIn.sh on current dB. This will create and import the raw data into new stats0 table with corresponding time stamps. 8. Run rlpIn.sh on current dB. This will append the current nh_rlp_boundary table with the neccessary information regarding the old data. 9. nhServer start 10. Allow normal functioning, i.e. rollups to occur as scheduled. 11. After rollups occur you should be able to view reports during the missing time period. Sincerely, 10/16/2001 11:46:40 AM wburke -----Original Message----- From: Marko, Simon [mailto:Simon.Marko@team.telstra.com] Sent: Sunday, October 14, 2001 7:07 PM To: # IS - Reporting Team Cc: # IS - Systems Team; Hughes, Graham; 'support@ipperformance.com.au'; 'support@concord.com' Subject: NHCN1 Status (IPM Ticket 162, Concord USA Ticket 54938) Folks, You should notice that the data which was 'missing' for September 27/28/29 for many customers has now returned due to some major database surgery. However - since fetches have been disabled, the latest data on the console is around Sunday Oct 8. If you notice some data is still missing (Only for the days specified above) can you please let Graham Hughes or Andrew Petrie know ASAP. I will be working with Concord USA again tonight to make sure the fetches run smoothly, until then please confine your report timescales to end before Oct 8, or wait until tomorrow. Thanks for your patience 10/16/2001 6:35:13 PM wburke -----Original Message----- From: Marko, Simon [mailto:Simon.Marko@team.telstra.com] Sent: Monday, October 15, 2001 1:41 AM To: 'Burke, Walter'; 'rickg@concord.com' Subject: RE: NHCN1 Status (IPM Ticket 162, Concord USA Ticket 54938) Hi Walter and Rick, Just FYI I have appended what has happened since Saturday AWST instead of sending the entire lot. We're now up to the stage of fetching, but I want to get your opinion on what we can do to protect ourselves against a failed fetch. I am planning to try calling you Monday AM Atlantic time (7 and/or 8) I appreciate your team working out how to do what I originally suggested - it's pretty complex but it seemed to work. As you can tell from the log, We don't have a 'backup machine' so I had to unzip the tables out of the savedb directory on the console disk and work out which of those *_b30.zip files was the nh_rlp_boundary table (with a lot of help from a hex editor). Add the odd segfault and cryptic ingres logs and what have you got? A weekend of fun and excitement!! Hope you both had a relaxing weekend - speak to you soon Regards Simon Marko Applications & Systems Team Telstra Internetwork Management Services 11/20/2001 11:52:32 AM yzhang problem solved | 10/12/2001 2:40:37 PM shagar Hans Maurer (Resller -Amasol) has spoken to Brad Carey regarding this and was told to open a ticket. Problem: Problem: nhiCfgServer crashes with certain sequence of DCI/manual operations This has been reproduced on 4.8, no patches, P4 and P6 Steps: 1- Start with an empty DB 2- Merge 2 distinct dci files into nh db containing elements 3- delete all newly merged elements 4- re-import 1 dci file Console messages generated: - user modifed poller config - stats poller updated with new config - server stopped unexepctedly - restarting server - successfull Cusotmer sent in a script and two .dci files to reproduce this with. Script and files on BAFS\55000\55038 10/15/2001 12:16:43 PM rtrei Dave-- Assigning to you since it is involved with DCI. If it is< DB, reasssign back to me. 10/24/2001 3:49:09 PM mwickham -----Original Message----- From: Wickham, Mark Sent: Wednesday, October 24, 2001 03:41 PM To: Shepard, Dave Subject: Problem Ticket 18455 Hi Dave, Do you have any status on the subject problem ticket, "nhiCfgServer crashes"? Thanks - Mark 10/24/2001 3:55:09 PM mwickham -----Original Message----- From: Shepard, Dave Sent: Wednesday, October 24, 2001 03:45 PM To: Wickham, Mark Subject: RE: Problem Ticket 18455 I am in class all week and not working on anything. I will then be catching up on escalated issues and 5.5 criticals when I get back. Figure at least a week. Cheers, --Dave 11/14/2001 10:17:15 PM drecchion -----Original Message----- From: Recchion, David Sent: Wednesday, November 14, 2001 10:08 PM To: Shepard, Dave Subject: Problem Ticket 18455 Hi Dave, This is a follow up to see if you've had a moment to get to this? Thanks, Dave R 11/16/2001 8:08:22 AM tbailey Don Gray recommended we get the config server debug when it crashes 1/9/2002 8:31:41 PM bedelson I ran the given script successfully (no crash) on 5.01. This suggest a problem outside of DCI (the only major area that would impact this problem that differs from 4.8 to 5.0 is groups-in-DB). The following information would aid evaluation: 1) Core dump as mentioned above. 2) system log 3) advanced log for cfgServer, dbServer and msgServer. In the mean time I will run their test on a 4.8 machine. 2/11/2002 3:03:27 PM dbrooks closed. will reopen if customer requests. +610/15/2001 10:16:59 AM foconnor Rollups take up to 3 days. If duplicate elements are deleted. - 6 hours MAY be acceptable. 3 days is not. - Glance reports permitted from drilldowns. Not from element lists. -They cannot run reports from the web when fetches are taking place - Error: 'bkws-ch-brn-s-01-M2-1-port' was not found in the lookup. Even if you can create TopN reports with these elements. Apparently, time to complete FetchDb increases over time. - Element lists not available on reporting interface while iimerge occurs. 4 min acceptable, 15 or more minutes not acceptable. - CT53419 - Memory consumption increases over time. for Cfg server, sysedge meassured. If Fetch fails, there is a gap in the central DB. How to re-fetch? Why DB fails? Why failovers? CT54113 - Transaction log was 300Mb. Garath increased to 2Gb. Improves a little. Rollups failed on 10/13. (after huge ammount of elements deleted). - Database lockups detected. Jose increased locks permitted. Rollups started. - Rollups take a long time anyways. 4 hours so far, and counting. According to ingres error.log, the shared memory limit suggested by our installation guide was not enough in at least one situation. (se log below) Files: //BAFS/escalated tickets/55000/55200 10/15/2001 10:19:57 AM foconnor Work/Modifications performed on system: Fetch frequency changed to 1h from 1/2 hr. 2 hrs would be better. The longer the better. Dupplicate elements removed from Db: sql nethealth delete from nh_elements where element_id<2000000\g\q cp $NH_HOME/poller/poller.init $NH_HOME/poller/poller.cfg do a fetch again. NOTE: it is possible that we may need to drop the nh_elem_alias also. If there is a clean workaround like this one that does not goes through the nh_deleted_element table, they would like to know it. $NH_HOME/idb/ingres/files/config.dat ii.sbe7441.dbms.*.rcp.lock.list_limit=820 0ld was 520 ii.sbe7441.dbms.*.per_tx_limit = 1000 Old was 700 Other questions/ideas/concerns: - Could it be that the shared memory we require by default is not enough? Could it be that it is not enough while running veritas? - Would maintenance job work? nethealthrc.sh.usr export NH_RESET_INGRES;NH_RESET_INGRES=yes However, just for the record, during all of these Db failures, ingres have been restarted cleanly. - Could FetchDB be done so that even if a failure occurrs, you can return to the previous state? (to avoid dupplicates) - A central console does not polls. Could we take the following two steps: 1. Be sure that the pollers are disabled in sys/startup.cfg - saves memory/proc time. 2. That the nhFetch erases any elements in the remote poller's range? (why do we keep dupplicates there). implications of using nhFetchDb's deleteElements() with MIN_ID and MAX_ID values of 1000000, and 2000000? - Is it safer to make sure that the job of type Local Fetch (similat to Fetch DB) is made with a load of 100 ? What are the consequences of it being 0? - What about the deadlock? Maximum number of locks. Is it safe to modify the mentioned ingres variables? Should we make recomendations for big systems, or for special situations? Errors detected on system: System log-Rollup failure: |1000063| 592|Error |nhiMsgServer |Database error: (E_US125C Deadlock detected, your single or mul| Database error: (E_US125C Deadlock detected, your single or multi-query transac tion has\n been aborted.\n (Sat Oct 13 03:13:18 2001)\n). |13-oct-2001 03:13:23 | |1000063| 593| |nhiMsgServer |Job step 'Statistics Rollup' failed (the error output was writte|Job step 'Stat istics Rollup' failed (the error output was written to /opt/ehealth/log/Statisti cs_Rollup.100000.log Job id: 100000). Another Db Error: |1000047| 3|Error |nhiMsgServer |Database error: (E_QE007D Error trying to put a record.\n (We |Database error: (E_QE007D Error trying to put a record.\n (Wed Oct 10 08:39 :26 2001)\n). |10-oct-2001 08:39:26 | |1000047| 4|Fatal Internal|nhiCfgServer |Call 'cdb*Op' to database API failed. (dbs/) | Error log deadlocks: SBE7441 ::[51845 , 00000022]: Sat Oct 13 03:13:14 2001 E_DMA00D_TOO_MA NY_LOG_LOCKS This lock list cannot acquire any more logical locks. The lock l ist status is 00000000, and the lock request flags were 00000008. The lock list currently holds 700 logical locks, and the maximum number of locks allowed is 70 0. The configuration parameter controlling this resource is ii.*.rcp.lock.per_t x_limit. SBE7441 ::[51845 , 00000022]: Sat Oct 13 03:13:14 2001 E_DM004B_LOCK_Q UOTA_EXCEEDED Lock quota exceeded. SBE7441 ::[51845 , 00000022]: Sat Oct 13 03:13:18 2001 E_DM9044_ESCALA TE_DEADLOCK Deadlock encountered while escalating to table level locking on table nh_deleted_element in database nethealth with mode 5. SBE7441 ::[51845 , 00000022]: Sat Oct 13 03:13:18 2001 E_DM0042_DEADLO CK Resource deadlock. SBE7441 ::[51845 , 00000022]: Sat Oct 13 03:13:18 2001 E_QE002A_DEADLO CK Deadlock detected. Another error in ingres error.log: THREAD The SCF alert subsystem event thread has been altered. The operation cod e is 0 (0 = REMOVE, 1 = ADD, 2 = MODIFY). SBE7441 ::[51845 , 00000004]: Sat Oct 13 03:27:35 2001 E_SC0235_AVERAG E_ROWS On 2959. select/retrieve statements, the average row count returned was 76. Another error in ingres error.log: PUTTING_RECORD Error trying to put a record. SBE7441 ::[59859 , 00001bce]: Fri Oct 12 11:09:52 2001 E_QE007D_ERROR_ PUTTING_RECORD Error trying to put a record. SBE7441 ::[59859 , 00001bd7]: Fri Oct 12 11:12:24 2001 E_CL0608_DI_BAD EXTEND Error allocating disk space write() failed with operating system error 27 (File too large) SBE7441 ::[59859 , 00001bd7]: Fri Oct 12 11:12:24 2001 E_DM9000_BAD_FI LE_ALLOCATE Disk file allocation error on database:nethealth table:nh_elemen t pathname:/opt/ehealth/idb/ingres/data/default/nethealth filename:aaaaaald.t00 write() failed with operating system error 27 (File too large) Another error< in ingres error.log: Maybe because of: ii.sbe7441.rcp.lock.resource_limit: 25924 -> 25924 (changed back) ::[II_RCP , 00000001]: Sat Oct 13 03:33:54 2001 E_CL121C_ME_OUT _OF_MEM MEget_pages: Can not expand memory size ME_get_shared: The request to the operating system to allocate 20455424 bytes of shared memory failed. This request most probably failed because the size of th e allocation exceeds the maximum size shared memory segment configured for the O S. In most system V interface compliant shared memory OS implementations this m aximum size is user tunable by altering the SHMSIZE parameter in the kernel conf iguration procedure. shmget() failed with operating system error 22 (Invalid argument) ::[II_RCP , 00000001]: Sat Oct 13 03:33:54 2001 E_DMA800_LGKINI T_GETMEM An unexpected error occurred when calling MEget_pages() to conne ct to the LG/LK shared memory segment. @ verified the /etc/system * eHealth Parameters start set shmsys:shminfo_shmmax=15073280 set shmsys:shminfo_shmmni=200 set shmsys:shminfo_shmseg=200 * eHealth Parameters end IPM output: (these things take about 30s each) lSession Detailqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqk xSession Name: ehealth Terminal: pts/10 x x State: CS_EVENT_WAIT (DIO) ID: 55 x x Mask: x xReal User: ehealth Apparent User: ehealth x x Database: nethealth DBA: ehealth x x Group Id: Role Id: x x x x Server Facility: QEF (Query Execution Facility) x x Application code: 00000000 x x Activity: x xLog records processed: 0 Current log address: x x Activity Detail: x x Session Description: x x Security Label: x xQuery: select IFNULL(max(element_id), 0)from nh_deleted_element where x x element_class= ~V and element_type!= ~V 10/15/2001 11:38:49 AM yzhang can you find out from customer the size of this physical file aaaaaald t00 10/15/2001 12:50:11 PM foconnor -rwx------ 1 ingres staff 8192 Oct 15 15:59 aaaaaalc.t00 -rwx------ 1 ingres staff 523444224 Oct 15 16:17 aaaaaald.t00 -rwx------ 1 ingres staff 40960 Oct 15 16:17 aaaaaale.t00 -rwx------ 1 ingres staff 9707520 Oct 15 16:17 aaaaaalf.t00 -rwx------ 1 ingres staff 247963648 Oct 15 15:29 aaaaaalg.t00 -rwx------ 1 ingres staff 696320 Oct 15 16:16 aaaaaalh.t00 10/16/2001 10:40:59 AM yzhang Farrell, Don agreed we need to a test on reboot, here is the brief instruction Find two unix with 48 install, set one of them as remote with SERVER_ID of 2, contruct the db on the remote site, then do a fetch, aafter fetch check that the element_id on the central element table is 2xxxxxx, then rebooot the central machine, and check if the element_id changed. Let me know the result. Thanks Yulun 10/16/2001 9:51:03 PM yzhang my plan for this problem is to clean the element and nh_deleted_element table on the central site, then run rollup, if it still take 3 day to finish, then you need to get the advanced log with debug to figure out where it hangs. to clean the tables:do the following on the central 1) echo "copy table nh_element() into 'nh_element.dat' \g" | sql nethealth 2) echo "copy table nh_deleted_element() into 'deleted_element.dat'\g" | sql nethealth have customer jkeep these two data files 3) echo "modify table nh_element() to truncated\g" sql nethealth 4) echo "modify table nh_deleted_element() to truncated\g" | sql nethealth 5) index the two tables (will send you script tommorow) 6) do a fetch on the debug mode which sh -x nhFetchDb ........>& fetch.out Yulun 10/17/2001 11:55:51 AM yzhang here is the command for indexing nh_element table CREATE UNIQUE INDEX nh_element_ix1 ON nh_element (element_id) WITH STRUCTURE=BTREE CREATE UNIQUE INDEX nh_element_ix2 ON nh_element (element_class, name) WITH STRUCTURE=BTREE here is the command for indexing nh_deleted_element table CREATE UNIQUE INDEX nh_deleted_element_ix1 ON nh_deleted_element (element_id) WITH STRUCTURE=BTREE you can put every thing on one script, then test the script, send me the script before you send to customer. Yulun 10/17/2001 12:08:00 PM foconnor -----Original Message----- From: O'Connor, Farrell Sent: Wednesday, October 17, 2001 11:50 AM To: 'Samuel.Schumacher@swisscom.com'; 'schorer@genesiscom.ch' Cc: O'Connor, Farrell Subject: Call ticket 55200 Sam/Patrick, Here is the complete instructions: To fix the statistic rollup problem we are going to truncate the nh_deleted_element table and the nh_element table. We have found that the nh_deleted_element table is very large and is probably the cause of the statistic rollups running so slow. I want you to do the following on the central server: 1) Perform a database save (just a precaution) %cd $NH_HOME %source nethealthrc.csh 2) echo "copy table nh_element() into 'nh_element.dat' \g" | sql nethealth 3) echo "copy table nh_deleted_element() into 'deleted_element.dat'\g" | sql nethealth Save the nh_element.dat and the deleted_element.dat files (copy them to a safe place). 4) echo "modify table nh_element() to truncated\g" sql nethealth 5) echo "modify table nh_deleted_element() to truncated\g" | sql nethealth I am going to send you a script for instruction #6 6) I will send a script to you to index the tables 7) Then I am going to have you do a fetch in debug mode which sh -x nhFetchDb ........>& fetch.out 10/18/2001 8:08:39 AM foconnor Rollups ran fine last night. 10/18/2001 11:53:39 AM yzhang I saw sde7441 remote has all the duplicate, this is why they are on the central site previousely. get the name and element id list from the central after the fetch. they need to : 1) clean sde 7441 remote site by removing any with element_id starting with 16, then remove the '-A' from all of the elements with element_id starting with 13 2) truncate nh_element and nh_deleted_element from central, index the tables, the fecth, 3) after the fetch succeeds, they need to remove all of the elements information with id starting with 16 from all of the stats table and element related table. let's do the step 1 and 2 first. Thanks Yulun 10/19/2001 10:33:07 AM yzhang waiting for response from customer regarding the dupliacte on the central site 10/22/2001 12:38:44 PM foconnor Statistic Rollups are working 10/22/2001 2:09:35 PM tbailey changed to fixed, de-escalated "10/18/2001 11:23:27 AM smoran Summary: we had a serious ingres database error last week on Oct, 2nd. due to that error the eHealth server stopped and we lost data for a couple of days. currently the system is up and running but we are still interested in the reason for the crash. we have added the system log and the errlog.log where some error messages can be found. Checked errlog ::[II_RCP , 00000001]: Tue Oct 2 09:46:50 2001 E_DM9673_DMVE_ALLOC_MAPSTATE Consistency Check: During recovery processing of database nethealth, an Allocate log record was encountered for table (iiattribute, $ingres) in which the allocated page state was not consistent with the Allocate log record. The Free Map (page number 129) lists the allocated page state as USED when the log record indicates that it should be FREE. The allocated page is page number 668 and its page status is 00000020. Since the recovery action is to set the page state to FREE< and it is already described as so, the condition is not considered fatal and recovery continues. Please report this occurrence to Computer Associates Technical Support and check the consistency of the above table using Verifydb. ::[II_RCP , 00000001]: Tue Oct 2 09:46:51 2001 E_DM964D_REDO_ALLOC Error redoing an allocate free page operation. nhstat11::[ingres , 000001d2]: Tue Oct 2 09:46:44 2001 E_CL2530_CS_PARAM sec_label_cache = 100 Checked system messages No server stop, but there is a server start Thursday, October 11, 2001 7:07:38 AM shagar Talked to Farrell never seen these messages before. should bug them to have engineering take a look. collect nhCollectCustData 10/23/2001 11:29:01 AM cestep Output of a sysmod: Sysmoding database 'nethealth' . . . Modifying 'iiattribute' . . . E_US1208 Duplicate records were found. (Tue Oct 2 09:47:15 2001) Sysmod of database 'nethealth' abnormally terminated. 11/5/2001 9:12:54 AM cestep The database was inconsistent. After destroy, create and load, the error was eliminated. 10/18/2001 11:28:00 AM smoran ================================================================================================== Summary: Database backup jobs started within the Console or the schedulder are failing with the following error in the log file: ... Unloading table nh_rlp_boundary . . . Unloading table nh_stats_poll_info . . . Unloading table nh_import_poll_info . . . Unloading table nh_bsln_info . . . no need for spliting Dac tables if nh_rpt_config is empty Unload of database 'nethealth' for user 'ehealth' completed successfully. Error: File not found. Error: File not found. Error: The program nhiSaveDb failed. we have uploaded the save.log and the system log to ftp.concord.com/incoming/izb1.zip Messages that user ehealth modified the poller configuration are plentiful before the save failed. This is the only occurance of the database save failing. Thursday, October 11, 2001 7:07:43 AM shagar Talked to Farrell never seen these messages before. should bug them to have engineering take a look. collect nhCollectCustData 10/24/2001 2:17:54 PM wburke -----Original Message----- From: Burke, Walter Sent: Wednesday, October 24, 2001 2:09 PM To: 'Thomas Dirsch' Subject: RE: Ticket # 55044 - nhSaveDb failure Thomas, It appears that the iiatributes table became corrupt. At this time I have logged a bug on your behalf in order for development to exam the causation. However, from the message in the ingres errlog.log it appears that the problem is: ::[II_RCP , 00000004]: ULE_FORMAT: Couldn't look up message 3a415 (reason: ER error 10903) E_CL0903_ER_BADPARAM Bad parameter ULE_FORMAT: Couldn't look up system error (reason: ER error 10902) E_CL0902_ER_NOT_FOUND No text found for message identifier Which may mean we have an non-recognized error code which we must correct. Sincerely, Walter 3/20/2002 8:10:34 AM dbrooks close per robin trei. ;.10/18/2001 3:03:07 PM rrick Reason: Customer originally ran out of space on disk-1. Added a new disk and needs to save off data and re-install NH. Problem: nhSaveDb is failing with the following error: Unloading table nh_job_step . . . Fatal Internal Error: Unable to execute 'COPY TABLE nh_job_step () INTO 'E:/nethealthdb.tdb/njs_b47'' (E_RD0060 Cannot access table information due to a non-recoverable DMT_SHOW error (Fri Sep 21 09:18:35 2001) ). (cdb/DuTable::saveTable) History: 1. Had customer remove and rebuild this table. 2. When they tried to run nhSaveDb again they received: Please see error in bafs/escalated tickets/54000/54306/nhSaveDb.doc 3. Had customer remove table and rebuild this table. 4. When customer tried to run nhSaveDb again they received: Unloading table nh_job_step . . . Fatal Internal Error: Unable to execute 'COPY TABLE nh_job_step () INTO 'E:/nethealthdb.tdb/njs_b47'' (E_RD0060 Cannot access table information due to a non-recoverable DMT_SHOW error (Fri Sep 21 09:18:35 2001) ). (cdb/DuTable::saveTable) 10/23/2001 12:06:36 PM yzhang copy dmt_show.sh from yzhang/scripts, and send to this customer. run it as nhuser after doing the source, have him mkdir $NH_HOME/test_dmt, then copy the script into this directory, and type the script name to run it Yulun 10/24/2001 3:10:25 PM rrick -----Original Message----- From: Edmond Lee [mailto:elee@fedcom.com] Sent: Tuesday, October 23, 2001 5:27 PM To: 'Rick, Russell ' Subject: RE: Ticket #54306 Hi Russ, Please see result below: Thanks -Edmond error (Tue Oct 23 14:29:13 2001) E_DM010B An error occurred while showing information about a table. (Tue Oct 23 14:29:13 2001) E_SC0206 An internal error prevents further processing of this query. Associated error messages which provide more detailed information about the problem can be found in the error log, II_CONFIG:errlog.log (Tue Oct 23 14:29:13 2001) * * Ingres Version II 2.0/9808 (int.wnt/00) logout Tue Oct 23 14:29:13 2001 INGRES COPYDB Copyright (c) 1987, 1998 Computer Associates Intl, Inc. Unload directory is 'D:\Nethealth\test_dmt'. Reload directory is 'D:\Nethealth\test_dmt'. There are 0 tables owned by user 'ingres'. There are 0 views owned by user 'ingres'. E_XF0019 There was a table or view specified on the command line that does not exist or is not owned by you. INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. Ingres Microsoft Windows NT Version II 2.0/9808 (int.wnt/00) login Tue Oct 23 14:29:17 2001 continue * * * * go * * set autocommit on Executing . . . continue * * set lockmode session where readlock=nolock Executing . . . continue * go * * set session with privileges=all Executing . . . continue * Ingres Version II 2.0/9808 (int.wnt/00) logout Tue Oct 23 14:29:17 2001 INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. Ingres Microsoft Windows NT Version II 2.0/9808 (int.wnt/00) login Tue Oct 23 14:29:18 2001 E_RD0060 Cannot access table information due to a non-recoverable DMT_SHOW error (Tue Oct 23 14:29:18 2001) E_DM010B An error occurred while showing information about a table. (Tue Oct 23 14:29:18 2001) E_SC0206 An internal error prevents further processing of this query. Associated error messages which provide more detailed information about the problem can be found in the error log, II_CONFIG:errlog.log (Tue Oct 23 14:29:18 2001) * * Ingres Version II 2.0/9808 (int.wnt/00) logout Tue Oct 23 14:29:18 2001 INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. Ingres Microsoft Windows NT Version II 2.0/9808 (int.wnt/00) login Tue Oct 23 14:29:19 2001 continue * * * * go * * set autocommit on Executing . . . continue * * set nojournaling Executing . . . continue * go * * set session with privileges=all Executing . . . continue * Ingres Version II 2.0/9808 (int.wnt/00) logout Tue Oct 23 14:29:19 2001 D:\Nethealth\test_dmt> -----Original Message----- From: Rick, Russell Sent: Tuesday, October 23, 2001 7:58 PM To: Zhang, Yulun Subject: FW: Ticket #54306 What do you recommend? -----Original Message----- From: Zhang, Yulun Sent: Tuesday, October 23, 2001 8:09 PM To: Rick, Russell Subject: RE: Ticket #54306 the script fails, what's kind of operation you have done with this table? regular drop ? verifydb drop table? select * from nh_job_step? help table nh_job_step ? did all odf thm fails, or some of them succeeded? Thanks Yulun -----Original Message----- From: Rick, Russell Sent: Wednesday, October 24, 2001 2:59 PM To: Zhang, Yulun Subject: RE: Ticket #54306 They all failed. - Russ 10/29/2001 11:20:17 AM yzhang the table.out file is emplty. I still believe customer can do verifydb drop table and asc copy out. Can you have customer do the following step by step manully from command l< ine, let me know the result for each step: 1) login as nhuser and source 2)create dir test_again 3)cd to test_again 4)copydb -c $NH_RDBMS_NAME nh_job_step 5)sql $NH_RDBMS_NAME < copy.out 6)verifydb -mrun -sdbname nethealth -odrop_table nh_job_step (look for iivdb.log) 7) sql $NH_RDBMS_NAME < copy.in Thanks Yulun 10/29/2001 4:09:48 PM rrick -----Original Message----- From: Rick, Russell Sent: Monday, October 29, 2001 4:01 PM To: 'elee@fedcom.com' Subject: RE: Ticket #54306 Edmund, Please execute the following: 1) Login as nethealth user. Any errors? 2) Create a directory called "test_again" Any errors? 3) cd to the "test_again" directory Any errors? 4) Login as the ingres user. Any errors? 5) cd to $NH_HOME\oping\ingres\bin Any errors? 6) Execute the following command: copydb -c $NH_RDBMS_NAME nh_job_step Any errors? 7) Execute the following command: sql $NH_RDBMS_NAME < copy.out Any errors? 8) Execute the following command: verifydb -mrun -sdbname nethealth -odrop_table nh_job_step Any errors? 9) cd $NH_HOME\oping\ingres\files 10)Look for iivdb.log.......save a copy to an email for me 11) cd $NH_HOME\oping\ingres\bin 12) Execute the following command: sql $NH_RDBMS_NAME < copy.in 13) cd to $NH_HOME 14) Login as the nethealth user. 15) Please forward the file I mentioned above to support@concord.com, Attn: Russ Rick - Russ 11/2/2001 11:23:16 AM yzhang Russell Have you done the step by step as I mentioned last time. Let me know which step has the problem. The other thing you can try is to run query: select file_name from iifile_info where table_name = 'nh_job_step' let me know if the file exist, Thanks Yulun 11/8/2001 6:43:03 PM rrick -----Original Message----- From: Edmond Lee [mailto:elee@fedcom.com] Sent: Thursday, November 08, 2001 6:09 PM To: 'Rick, Russell ' Subject: RE: Ticket #54306 Hi Russ, Please see result as you request -Thanks, Edmond Please execute the following: 1) Login as nethealth user. Any errors? ok 2) Create a directory called "test_again" Any errors? 3) cd to the "test_again" directory Any errors? ok 4) Login as the ingres user. Any errors? ok 5) cd to $NH_HOME\oping\ingres\bin Any errors? ok 6) Execute the following command: copydb -c $NH_RDBMS_NAME nh_job_step Any errors? Yes, D:\Nethealth\oping\ingres\bin>copydb -c $NH_RDBMS_NAME nh_job_step INGRES COPYDB Copyright (c) 1987, 1998 Computer Associates Intl, Inc. E_US0010 Database does not exist: '$nh_rdbms_name'. (Thu Nov 08 11:29:06 2001) 7) Execute the following command: sql $NH_RDBMS_NAME < copy.out Any errors? Yes, D:\Nethealth\oping\ingres\bin>sql $NH_RDMBS_NAME < copy.out The system cannot find the file specified. D:\Nethealth\oping\ingres\bin>cd.. D:\Nethealth\oping\ingres>cd.. D:\Nethealth\oping>cd.. D:\Nethealth>$NH_RDBMS_NAME < copy.out The name specified is not recognized as an internal or external command, operable program or batch file. 8) Execute the following command: verifydb -mrun -sdbname nethealth -odrop_table nh_job_step Any errors? Yes, D:\Nethealth>verifydb -mrun -sdbname nethealth -odrop_table nh_job_step S_DU04C4_DROPPING_TABLE VERIFYDB: beginning the drop of table nh_job_step from database nethealth. E_DU501A_CANT_CONNECT Unable to connect with database nethealth. E_DU5024_TBL_DROP_ERR Unable to destroy table nh_job_step from database nethealth. D:\Nethealth\oping\ingres\bin>verifydb -mrun -sdname nethealth -odrop_table nh_j ob_step nh_job_step E_DU5002_INVALID_SCOPE_FLAG INVALID SCOPE FLAG: dname E_DU5007_SPECIFY_ALL_FLAGS VERIFYDB must be evoked with -m, -s and -o flags: VERIFYDB -ModeXXX -ScopeYYY -OperationZZZ <-uUSERNAME> <-nolog> where: XXX = REPORT, RUN, RUNSILENT, RUNINTERACTIVE YYY = DBNAME (followed by list of up to 10 names in double quotes), DBA, INSTALLATION ZZZ = DBMS_CATALOGS, FORCE_CONSISTENT, DROP_TABLE, PURGE, TEMP_PURGE, EXPIRED_PURGE, TABLE, XTABLE, ACCESSCHECK <> denotes optional flag. 9) cd $NH_HOME\oping\ingres\files 10)Look for iivdb.log.......save a copy to an email for me See attachment 11) cd $NH_HOME\oping\ingres\bin 12) Execute the following command: sql $NH_RDBMS_NAME < copy.in D:\Nethealth\oping\ingres\bin>sql $NH_RDBMS_NAME < copy.in The system cannot find the file specified. 13) cd to $NH_HOME 14) Login as the nethealth user. 15) Please forward the file I mentioned above to support@concord.com, Attn: Russ Rick - Russ -----Original Message----- From: Zhang, Yulun Sent: Thursday, November 08, 2001 6:27 PM To: Rick, Russell Subject: RE: 18631 step 5 and all steps after 5 need to be done by nhuser -----Original Message----- From: Rick, Russell Sent: Thursday, November 08, 2001 6:32 PM To: 'Edmond Lee' Subject: RE: Ticket #54306 Hi Edmund, Please execute step 5 and all steps after 5 need to be done by nhuser. Thanks again, - Russ 11/15/2001 1:49:01 PM mfintonis update from Tony Piergallini: Russ contacted the customer on 11/14/01. Currently we are waiting for a customer response. 11/15/2001 4:49:36 PM rrick -----Original Message----- From: Rick, Russell Sent: Thursday, November 15, 2001 03:01 PM To: Fintonis, Melissa Cc: Gray, Don; Wickham, Mark Subject: RE: 18631 Melissa, The following the email I sent the customer today to try to complete Yulun's last exercise: Hopefully this will help. Dear Edmund, In effort to resolve your call ticket, I have made 4 attempts to have you complete the last exercise sent to you on Thursday, Nov. 8th, 2001. Below are the dates I have contacted you about completing this task so we can move forward to try to resolve your issue: Thursday, November 08, 2001 6:09 PM Thursday, November 08, 2001 6:27 PM Thursday, November 08, 2001 6:32 PM Wednesday, November 14, 2001 12:58:26 PM This issue is currently set as an Escalted Critical issue. Therefore, if I do not hear back from you by the end of the day today, Thursday, Nov. 15th, 2001, then I will have to e-escalted this issue and close your call ticket. You can open a new call ticket number with Concord Communications by calling Technical Support at 1-888-832-4340 if you believe the incident needs further work. Regards, Russell K. Rick, Senior Support Engineer 11/20/2001 11:54:02 AM yzhang problem solved X10/19/2001 10:07:10 AM dwaterson 4.7.2, P3, D7 Windows NT SP5 Issue: Statistic Rollups keep failing with the following pattern - Scheduled Calculate Baseline job - After this appears to finish - errors are written into the ingres error log - Then the scheduled rollup run and fails with the following error: Sql Error occured during operation (E_QE008A Error trying to destroy a table. (Thu Oct 04 19:00:35 2001) Note: many database saves give the following warnings: Unloading table nh_stats_poll_info . . . Unloading table nh_import_poll_infzip warning: error deletingE:/nethealth/db/save/Friday/friday.tdb/nh_stats0_1001836799 o . . . *************************************** Job started by Scheduler at '4/10/2001 19:00:13'. ----- $NH_HOME/bin/sys/nhiCalcBaseline ----- Scheduled Job ended at '4/10/2001 19:00:20'. ----- ******************************************* Excerpt from error log: 0000016d Thu Oct 04 19:00:35 2001 E_CL060D_DI_BADDELETE Error deleting file or directory delete() failed with operating system error 32 (The process cannot access the file because it is being used by another process.) 0000016d Thu Oct 04 19:00:35 2001 E_DM9003_BAD_FILE_DELETE Disk file delete error on database:nethealth table: pathname:e:\nethealth\oping\ingres\work\default\nethealth filename:ppppbidj.t00 delete() failed with operating system error 32 (The process cannot access the < file because it is being used by another process.) 0000016d Thu Oct 04 19:00:35 2001 E_DM9290_DM2F_DELETE_ERROR Error occurred while attempting to delete a file. 0000016d Thu Oct 04 19:00:35 2001 E_DM9340_DM2F_RELEASE_ERROR Error releasing a File Control Block. 0000016d Thu Oct 04 19:00:35 2001 E_DM9270_RELEASE_TCB Error occurred releasing a TCB. ZWART ::[II\INGRES\144 , 000000a6]: Fri Oct 05 07:20:40 2001 E_GC0001_ASSOC_FAIL Association failure: partner abruptly released association *********************************** ----- Job started by Scheduler at '4/10/2001 20:00:13'. ----- ----- $NH_HOME/bin/sys/nhiRollupDb -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (4/10/2001 20:00:13). Error: Sql Error occured during operation (E_QE008A Error trying to destroy a table. (Thu Oct 04 19:00:35 2001) ). ----- Scheduled Job ended at '4/10/2001 20:00:35'. **************************** Excerpt from the database save (Note: many of the database saves are giving nh_import_poll_infzip warning referencing different tables. ---- Job started by Scheduler at '5/10/2001 04:00:14'. ----- ----- $NH_HOME/bin/sys/nhiSaveDb -u $NH_USER -d $NH_RDBMS_NAME -p E:/nethealth/db/save/Friday/friday.tdb ----- Begin processing (5/10/2001 04:00:14). Copying relevant files (5/10/2001 04:00:16). Unloading the data into the files, in directory: 'E:/nethealth/db/save/Friday/friday.tdb/'. . . Unloading table nh_stats_poll_info . . . Unloading table nh_import_poll_infzip warning: error deletingE:/nethealth/db/save/Friday/friday.tdb/nh_stats0_1001836799 o . . . Unload of database 'nethealth' for user 'neth' completed successfully. End processing (5/10/2001 04:07:35). ----- Scheduled Job ended at '5/10/2001 04:07:35'. ----- ******************************* The customer uses 2100 element licenses, but has about 10000 database elements in the poller configuration. The Statistic Rollup job is set to : As polled : 8 Days 1 hour samples : 6 weeks 1 day samples : 70 weeks Following information on bafs:\\54000\54890 - 'Advanced Logging...' a) Configuration b) Messaging c) Database Rollup - database saves. - Output from the command: echo help table *\g | sql nethealth > c:\temp\tables.out - winmsd to provide the System Resources 10/22/2001 10:51:21 AM wburke -----Original Message----- From: Burke, Walter Sent: Monday, October 22, 2001 10:43 AM To: 'Herman Van den Broeck'; support Subject: RE: CC54890 Herman, Please run the following: nhServer stop verifydb -mreport -sdbname nethealth -odbms_catalogs Send output and $NH_HOME/oping/ingres/files/iivdb.log then run sysmod nethealth, capture output, send. Sincerely, 10/30/2001 11:29:01 AM wburke Noth sysmod and verify ran successfully. 11/5/2001 4:11:24 PM yzhang 1) see if customer can manully delete this fil: e:\nethealth\oping\ingres\work\default\nethealth filename:ppppedkc.t00 2) echo "select Ttable_name from iifile_info where file_name = 'ppppedkc.t00'\g" | sql nethealth 3) check the current rollup status from customer, find out what's the current rollup problem they have. Thanks Yulun 11/5/2001 4:12:45 PM wburke -----Original Message----- From: Zhang, Yulun Sent: Monday, November 05, 2001 4:03 PM To: Burke, Walter Subject: 18658 1) see if customer can manully delete this fil: e:\nethealth\oping\ingres\work\default\nethealth filename:ppppedkc.t00 2) echo "select Ttable_name from iifile_info where file_name = 'ppppedkc.t00'\g" | sql nethealth 3) check the current rollup status from customer, find out what's the current rollup problem they have. Thanks 11/5/2001 4:16:59 PM yzhang customer agreed to close the ticket, and problem is no longer exist 10/31/2001 11:32:27 AM foconnor Distributed polling. Central Server HPUX - hostname toilet NH 4.8 Patch 6 D05 Server id 1 Remote Poller Solaris 2.8 - hostname bulldozer NH 4.8 Patch 6 D05 Server id 20 Yulun ran test: Central had duplicates -> all elements have a 1,000,000 entry with no data and a 20,000,000 entry with a "-A". Central is not polling the same elements. Remote has no duplicates and all the elements are in the 20,000,000 range. Yulun removed all the 1,000,000 id range elements with sql statement -verified they were gone. nhReset (with the NH_RESET_INGRES variable set) nhServer stopped, ingres reset and the nhServer restarted successfully.. Yulun ran the query in a sql session: select name, element_id from nh_element order by name\g And there were duplicates again in the 1,000,000 range. Looks like it is getting built by reading the poller.cfg first. We have had recently several customers that are experiencing duplicates on distributive polling sites. 11/1/2001 1:47:14 PM yzhang This is a procedural error, update to nobug We did see duplicate on the central site on the following two situations: 1) let's say that we have a successful fetch on the central site, then do nhDestroyDb, neCreateDb, start nhServer, this will cause the element with 1000000 range to be populated into nh_element tables. the way to correct this is to overwrite poller.cfg with poller.init after destroyed and before createdb. So we need extra precaution when cycling database on the central site. 2) if we are running a query to truncate or delete element table from central site, we only remove the element information from database, but the same information is still in poller.cfg, if starting server after running the truncate or delete query the same element info will be populated to element table. The way to correct this is that every time if we want to remove elements from database on the central site, we need to run nhDeleteElement executable, that will remove the element info from database and config file. Don't run truncate or delete query, the scheduled maintenance job is doing the same thing as Stop and Start nhServer, so above situation apply to maintenance job. This means if you run truncate or delete query on Monday, you will see duplicate elements on next Monday(assume the maintenance job is run on Sunday) Thanks .10/31/2001 3:34:37 PM dblodgett Problem: - customer was running nhDbStatus periodically on a server in order to monitor the database liveness, but - when nhDbStatus is run it puts table locks on the database, - if another process is using a table then nhDbStatus may disconnect that other process from the database, - this can cause the database to go inconsistent, - necessitating the recovery procedure - this has been replicated by Andrew Messer in the field, also this has been recognized by Senior Technical Support personel Current work around: we recommend that the customer not use nhDbStatus for monitoring the database, we recommended he monitor file size using system commands 11/1/2001 10:40:25 AM jpoblete Customer stopped using nhDbStatus to monitor the Db, it stopped to go inconsistent. 11/1/2001 1:43:10 PM yzhang Jose, The custome has deadlock on iirelation The sysmod should clear up the deadlocks on iirelation. It does in most cases. The important thing you need to stress to customer is that optimizedb and sysmod should be performed on a regular basis to keep ingres running at peak performance. login as nhuser source nethealthrc optimizedb dbname > optimize.out sysmod dbname >sysmod.out Thanks Yulun 11/2/2001 9:50:36 AM jpoblete Yulun, I passed your instructions to the customer, but they asked to close out the call ticket. I'll await until monday to see if they come back to me. -JMP 11/2/2001 11:06:26 AM yzhang problem is no longer exist, ticket closed 10/31/2001 4:55:29 PM wburke nhiDialogRollups appear to fail silently Running via command line, begin and ends with no error in Adv.logging however, the log shows that the process abruptly ends. help\g on dB shows< NO dlg1s tables BAFS/55638/ for - adv.logging - dbStatus -Tables,.out Spoke with both Robin and Brad. This proprably indicates a blown stack on the nhiDialogRollup.exe Will try to get B.Keville to help me build, otherwise need to escalte to have eng build for customer. 11/1/2001 4:58:46 PM wburke Robin, I was able to build ( with Yulun's Help) the new stack size for the issue. I will let you know if it worked. 11/2/2001 11:56:31 AM wburke -----Original Message----- From: Dan Kaskel [mailto:allied62@yahoo.com] Sent: Friday, November 02, 2001 11:01 AM To: Burke, Walter Subject: Re: Ticket # 55638 - nhiDialogRollup Fails Walter, Rollup completed successfully. Thanks for your help. Dan 11/9/2001 11:44:10 AM yzhang problem solved 11/1/2001 4:33:00 PM tfuller Customers Maintenance fails after iirelation' fails. Customer has had multiple instances of his database going inconsistent. Also states that eHealth is using all 2GB of memory on his box, Belief is that it is related to Ticket# 15591 "Memory leak on HP 11" Runs scheduled maintenance Sunday, Tuesday and Thursday at 9am CST Maintenance log shows: Sysmoding database 'nethealth' . . . Modifying 'iiattribute' . . . Modifying 'iidevices' . . . Modifying 'iiindex' . . . Modifying 'iirelation' . . . E_US1200 Table name is not valid. (Tue Oct 30 10:03:08 2001) Sysmod of database 'nethealth' abnormally terminated. Starting Network Health servers. This is a very critical customer situation. Support believes that the Database should be destroyed, created, and reloaded from a save. The issue we are facing is that we believe due to the customers past history and known issues, such as a memory leak, we will only be delaying another issue with an already irate customer. We would like a total solution for the issues that are being seen so that we can give the customer some assurance the latest group of fixes will be the last he will need for a while. 11/2/2001 12:00:34 PM yzhang have him do a quick query as nhuser: help table iirelation > relation.out help table iiattribut > attribut.out 11/2/2001 12:15:15 PM yzhang have him do a quick query as nhuser: help table iirelation > relation.out help table iiattribut > attribut.out We have customer, who's ingres has been completely down due to the error on reading the ingres system catalog: the following is the last piece of error.log. The problem seems to start with Incorrect count of key attributes for table $ingres.iiridxtemp database nethealth. when I have customer run sysmod, they got the following output: Sysmoding database 'nethealth' . . . Modifying 'iiattribute' . . . Modifying 'iidevices' . . . Modifying 'iiindex' . . . Modifying 'iirelation' . . . Looks the system catalog has been corrupted. Why htis is happen, and how to correct it. Thanks YUlun 11/2/2001 2:56:40 PM yzhang Thomas, Can you have customer do the following: 1) backpup database 2)as ingres, ingstop (or ingstop -force) 3) ingstart 4) verifydb -mreport -sdbname "nethealth" -odbms_catalogs, the result of this will append to iivdb.log, send me the iivdb.log Thanks Yulun 11/2/2001 4:29:52 PM yzhang Here is the verifydb output: I think what our client should do is: login as ingres, then VERIFYDB -mruninteractive -sdbname iidbdb -odrop_table iiridxtemp my question is which db has the table iiridxtemp. is it iidbdb? is iiridxtemp a tempirary table? why the number of colmun in iiridxtemp is mismatched between iirelation and iiattribute. Thanks Yulun 11/2/2001 5:21:06 PM yzhang unfortunately, the clint's database has no checkpoint and no journal. That means there is no way they can rollforward, right? and the only thing they can do is savedb, destroy, create and reload, right? Is there any other way beside this 11/5/2001 9:41:48 AM yzhang The customer's database system catalog has been corrupted, and there is no way to rollforwarddb because they have no checkpoint save. They need to recycle the database by doing nhSaveDb, destroydb, createdb and loadDb, please write the instruction for this customer, and make sure they have successful bdsave before destroying the database. Thanks Yulun 11/5/2001 2:16:24 PM yzhang Chunk, Your database system catalog has been corrupted, and there is no way for you to rollforwarddb because there is checkpoint save and no journalling. Now what you can do is : 1) nhSaveDb (backup the current database 2) nhDestroyDb nethealth 4) nhCreateDb nethealth 5) nhLoadDb Let me know if you can handle this, or if you need more information regarding how to do these. make the db save is successful before you destroyDb Thanks Yulun 11/5/2001 5:53:32 PM yzhang the only thing you can do now is to recycle the database, you can start doing the db save now, and I don't think the db save will be a problem, after the save finishes, send me the save.log. after the recycle, I will have you do the checkpoint save and journalling the database, as well as optimize database frequently, which will prevent further system catalog corruption. 11/7/2001 12:06:33 PM apier De-escalated per daily bug meeting. Waiting for Yulun to provide exact steps for captureing root cause of this problem if it occurs again. 11/8/2001 6:20:03 PM yzhang Ravindra, Our customer decide to recycle the database again, but they said they need solution, or better approach instead of recycle database if the system catalog corruption occurs again. Now here is my questions: 1) what we are going to do to prevent further corruption? 2) what information you need me to collect from our customer for you to research the problem and recover the database without recycling database next time. I would like to do all the preparation we can to take care the further corruption. what in my mind now is that after this recycle they need to run optimizedb, sysmod frequently, and also they need to do ckpdb +j so that rollforwarddb can be used for next corruption Can you reply me as soon as possible, including command and detail so I can instruct our customer. Thanks 11/12/2001 9:39:42 AM yzhang The db load is fine, ignore the warning or error message in the load.log. The error in the errlog is from database recycle. you can ignore them too, since your system is come up and running. Let us know if you encounter problem. Thanks Yulun 11/13/2001 1:46:26 PM yzhang customer did upgrade to 5.0.1 11/2/2001 6:05:22 AM foconnor Castomer is complaing that some fetches are taking 22 minutes and other fetches are taking 4 to 5 minutes. The customer performs fetches every hour and 22 minutes for each fetch they feel is intolerable because it makes reporting from the web limited due to the fact web reports cannot be ran while fetches are taking place. One Central, 2 remotes ~26000 elements. Customer has experienced several issues with this distributed polling environment including statistic rollup problems (fixed), duplicates (fixed but may be still getting them) and fetch issues. 11/7/2001 1:06:32 PM mwickham This problem ticket is being escalated due to customer sensitivity and potential revenue impact ($300-400K in 2002). We need to know if Network Health is performing as designed by taking 22 minutes to complete the fetch/merge. If it is working is designed, please provide feedback to Support so we can explain it properly to the customer and stand behind the product. If specific information from the customer's site would be helpful in making an Engineering answer, please let us know. 11/7/2001 3:22:44 PM yzhang I reviewed all of the available information: here is what I think the problem is: The major problem this customer has is that the nh_element, nh_elem_assoc and nh_deleted_elem and nh_elem_alias tables are messed with a lot of elements with element_id in the 1000000 ranges (for example, nh_elem_alias has more than 10000 element with id in 1000000 range). On the central si< te these tables don't suppose to have element with its id in 1000000 range. The first thing they need to do is clean up (through nhDeleteElement) each of the tables and the poller.cfg Sheldon, I think Farrell has left for the day, Can you have somebody in support do some practice regarding how to use nhDeleteElement with element_id range as the argument, then send me the command, I want to see command before you send to customer. Thanks Yulun 11/7/2001 5:12:40 PM mwickham -----Original Message----- From: Wickham, Mark Sent: Wednesday, November 07, 2001 05:00 PM To: Zhang, Yulun Cc: O'Connor, Farrell Subject: FW: escalated 18957 Yulun, This is what we were thinking: echo "select name from nh_element where element_id > 1000000\g " | nhiSql nethealth > file.out nhDeleteElements -inFile file.out What do you think of this approach? - Mark 11/7/2001 5:22:51 PM yzhang use this echo "select name from nh_element where element_id between 1000000 and 2000000\g" | sql nethealth >file.out nhDeleteElements -inFile file.out, if this does not work you need to format the file.out so that only element name are listed in the file. if it works, you need to check if the same element information have been removed from assoc, alias and deleted_element table do a test until you are confident, I still want to see the command before you send to customer. Thanks Yulun 11/8/2001 11:17:19 AM yzhang grap /export/sulfur1/nh48/bin/nhfetchDb_timestamp, and backup the one they have, change the permission for the new one. run the following command nhfetchDb_timestamp > nhfetchDb_timestamp.out 11/8/2001 1:32:17 PM foconnor Sent directions to download nhFetchDb_timestamp 11/9/2001 2:45:22 PM yzhang Every thing looks fine, the major time spent is on inserting element table, that takes about 17 minutes. For inserting 30000 element with duplicate check, this is very reasonable time. I recommend close this ticket. Yulun 11/12/2001 8:06:15 AM foconnor -----Original Message----- From: Patrick Schorer [mailto:schorer@genesiscom.ch] Sent: Monday, November 12, 2001 7:29 AM To: O'Connor, Farrell Subject: RE: Call ticket 54899 Importance: High Hi Farrell Thanks for your answer. Have you any idea why the fetches takes more time without duplicates? What is the difference when we have duplicates? I think there is something different but the fetches takes 5 minutes WITH duplicates and 22 minutes WITHOUT duplicates. This is a little bit confusing. When the fetch job takes 22 minutes WITH duplicates and 5 minutes without duplicates then i would say this is "normal". Did you have any explanation for that?? Thanks and best regards Patrick 11/12/2001 12:29:09 PM yzhang First I want to verify it only takes 4 min for 30000 element, check with customer to see if they want to do the following: 1) disable nhFetch from central 2) save db on the central site 3)destroy, create and reload (overwrite poller.cfg with poller.init before createDb 4) run a manual fetch with nhFetchDb_timestamp located on /export/sulfur1/nh48/bin/nhFetchDb_timestamp, send the debug output. get nhFetchDb_timestamp again, because I modified from last time. If customer don't want to do this.. I suggest you do a fetch test under the similar situation with customer, especially make sure you have about 30000 elements. Thanks Yulun 811/8/2001 3:50:06 PM jnormandin Problem: Customer is experiencing statistics rollup errors due to duplicate keys: Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. Dump of all nethealth tables ( help\g) indicate the earliest stats0 table to be from May 11, 2001. The customer did not notice this failure untill now as his disk space is reaching capacity ( 15 gig drive ). The rollup error referances an index failure, but all of the stats0 tables do infact have an index ( 1 and 2 ). I have attempted the following: 1- Ran nhiIndexDiag. - Returned no problem stats tables, but made referance to several others regarding btree and HEAP failures: Problem encountered with analyzing table nh_daily_exceptions Error: Indexing problem: nh_daily_exceptions should have been btree but was HEAP. Problem encountered with analyzing table nh_daily_health Error: Indexing problem: nh_daily_health should have been btree but was HEAP. Problem encountered with analyzing table nh_daily_symbol Error: Indexing problem: nh_daily_symbol should have been btree but was HEAP. Problem encountered with analyzing table nh_hourly_health Error: Indexing problem: nh_hourly_health should have been btree but was HEAP. Problem encountered with analyzing table nh_hourly_volume Error: Indexing problem: nh_hourly_volume should have been btree but was HEAP. Table nh_job_schedule is lacking an index. Duplicate problem: Found 0 duplicates out of 45 rows for index job_schedule_ix on table nh_job_s chedule. Analysis of indexes on database 'nethealth' for user 'netview' completed successfully. - I received the same info on my system, so I am assuming these to be benign. 2- Ran the cleanstats.sh script ( without and with clean argument ) - returned no duplicate stats tables. 3- Attempted to incrementally rollup the db ( one day at a time ) via a script from May 15 to November 5 - The script ran successfully for May 15 to May 18 and then failed indicating the same duplicate key error for the May 19 rollup - The odd thing is that no stats0 tables were actually rolled up as the help\g output indicated the same earliest stats0 table. - The rollup also oddly reclaimed disk space ( 1 gig or so ). 4- Attempted to re-index the database via the nhiIndexDb command. - This was successfull C:\>nhiindexdb -d nethealth -u netview Creating the Table Structures and Indices . . . Creating the Table Structures and Indices for sample tables . . . Granting the Privileges . . . Granting the Privileges on the sample tables . . . Index of database 'nethealth' for user 'netview' completed successfully. 5- I then attempted to re-run rollups for the date which exhibited the 1st failure in my script ( May 19 ) - nhiRollupDb -u $NH_USER -d $NH_RDBMS_NAME -now 5/19/2001 - This failed with the same error as above ( duplicate keys ). 6- I had the customer re-run the May 19 rollups in debug mode - nhiRollupDb -u $NH_USER -d $NH_RDBMS_NAME -now 5/19/2001 -Dall - Saved onto Bafs for ticket 56040 I do not see how the rollup can be failing on an index when all stats tables do have an index and the nhiIndexDb was successfull. All relevant files have been placed on Bafs for call ticket # 56040 11/8/2001 4:55:33 PM yzhang after you have taken care the space problem, have customer do the following: echo "create table nh_stats1_989639999 as select * from nh_stats1_989553599 where 1=2\g" | sql nethealth Then run stats rollup (without debug), and send me the rollup log file. Thanks Yulun 11/8/2001 6:29:06 PM yzhang please obtain the database, this is a problem where certion type of elements cause the duplicate during stats rollup, we need to find out what element causes the problem with their database. Thanks Yulun 11/12/2001 2:52:53 PM yzhang Yes, you can write it in this way, here is two comments: 1) use regular drop, and add commit statement after each drop. 2) add updating nh_rlp_boundary in the loop and make a another while loop for updating nh_rlp_boundary table. after you finish scriping, create a same situation on your system for testing your script. you can send me the script again if you want. Thanks Yulun 11/14/2001 3:05:36 PM jnormandin - Customer suggested we drop stats0 tables from May - Sept. Did this ( dropped 2500 + tables. ) - At this point, re-ran rollups and all is well. - Problem table was dropped during above procedure S11/12/2001 12:58:02 PM tfuller From: Zhang, Yulun Sent: Monday, November 12, 2001 11:23 A< M To: Chapman, Sheldon; Fuller, Thomas Subject: chunk's db problem their nhDbStats hangs no mater running from console or command line. I have them stoped nhServer, and ingres, then restart everything. nhDbStats still hangs and they noticed that iidbms takes 98-100%cpu. Collecting teh following information: 1) verifydb -mreport -sdbname "" -odbms_catalogs. (collect iivdb.log) 2) advanced debug for nhiDbStatus 3) errlog.log 4) results of a sysmod nethealth 11/12/2001 5:19:45 PM tfuller Yulun, I put the output of the commmands you requested on BAFS. Spoke to proserve person on site, JimR. He stopped the running nhDbStatus command and CPU usage went to 6% Checked the log files, and feels that only problem right now is that nhDbStatus hangs and consumes 104% of CPU until ended He is planning to upgrade the server to eHealth 5.0.1 tomorrow morning 11/13/2001 11:40:01 AM yzhang Yes, we need to look at the problem, but if they upgrade, then there is no environment for us to study this problem. I think they can do upgrade now, then we need to keep very close watch on the behavior of iidbms, and system catalog. Thanks Yulun 11/13/2001 2:06:02 PM yzhang the customewr is upgrading to 5.0.1 11/12/2001 3:45:25 PM beta program PWC: Don Mount email =donald.d.mount@us.pwcglobal.com jobtitle =Manager company =Pricewaterhouse Coopers address =3109 W. MLK Jr. Blvd phone =813-348-7252 Description: Database Status window does not reflect the database disk freespace. Database status window needs complete rework for Oracle database. 11/13/2001 11:02:15 AM wzingher Robin, this was fixed with your checkin yesterday, I believe. Can you confirm? Thanks. 11/13/2001 11:02:56 AM wzingher Oops, that comment was for another ticket. This is for you to check, Saeed. 11/13/2001 6:41:25 PM shonaryar I already made changes to CdbSystem which reflects this ticket 12/27/2001 11:26:15 AM beta program This bug is NOT fixed. See below email from beta site: Database Stutus window shows incorrect data for ORACLE database. Free Space. Conversation status show Database size 0.0 for As polled coversations, rolled up conversations and rolled up top conversations. Reported in beta 1- bug 19149 1/7/2002 11:31:20 AM wzingher Reassigning to B4, please test and validate. 1/14/2002 5:39:43 PM rhawkes This is probably addressed by work that Gary Pratt has completed -- reassigning to Gary for verification. 1/15/2002 11:59:15 AM gmp after reading this ticket, it is not something that would be fixed by my recent 'analyze table' changes. this appears to be a bug in the code that calculates 'Traffic Accountant' table space is not working (see the 12/27/01 comment). reassigning this to rhawkes 1/15/2002 1:38:58 PM rhawkes After discussions with Gary and Saeed, we've determined that this is fixed by the recent analyze indexing tables work. This should be retested in a kit which also includes the fix to 20352, which should be available on 1/16. 2/26/2002 10:19:36 AM Betaprogram Don Mount said that this Fix IS NOT VERIFIED on his system. Reopening the ticket 2/27/2002 1:05:17 PM rhawkes Saeed has a new fix. Ravi has seen this problem on his system, so Saeed will test it there. 2/28/2002 5:14:29 PM shonaryar Changed routines getDbFreeSpace and getDbSize saeed 3/12/2002 11:33:30 AM Betaprogram Don Mount has upgraded to Beta 5 and he has reported that this still does not reflect proper free disk space. ReOpening Ticket. 3/13/2002 10:08:33 AM shonaryar Free Spaces in DbStatus window does not represent free space on disk it represents free space on tables space. Table space is shown in Location name column. Since a tablespace can include different datafiles and each datafile can be on different device, in 56 we probably should change UI to allow us to display number Locations per location nameThis way we can list free space on each device. saeed p11/13/2001 1:38:02 PM bmiller Customer: TECO Contact: John Witte Phone: 770-844-5911 Email: jwitte@concord.com On Live Exceptions browser, there are no technologies or groups listed under the 'All Technologies' field on the left hand side of the browser. - Customer had NH 4.8 on an NT machine. - John saved DB on the NT machine - Did a fresh install of 5.0 RTM on a Win2k machine. - Loaded the db from the NT machine. eHealth runs fine, execept when he starts Live Exceptions, he gets a pop up that says: "the number of monitored polled elements has exceeded the licensed limit" He says OK to the error, and LE appears, but under 'all technologies' in the window on the left, there is no pull down. He can't access the other technologies at all. He saved the DB from the 4.8 box using the -ascii flag and tried again - same results He loaded the DB onto a different 4.8 NT box, and then upgraded to 5.0.1 - same results He sent in copy of the db save, and while we did not receive the license error that John did, we were able to reproduce the problem with the listing of technologies and groups not appearing. A copy of the db is saved to BAFS (BAFS/escalated tickets/56000/56029/TECO_dbsave.zip) along with load logs and screen shots of the problem. 11/13/2001 2:46:02 PM tlarosa The licensing scheme for LE has changed so a new licenses needs to be genereated for the number of polled elements. My first thoughts on nothing showing under "All Tech" is that there is an OM conversion problem and this does not have anything to do with the browser but is an upgrade issue. Let me check into it. Tom 11/13/2001 6:30:55 PM tlarosa Hi Guys, The problem with this upgrade is a database inconsistancy in table: nh_group_list_members. This table referes to a group that does not exist in the group table which means we have an upgrade problem. To get unblocked the customer could delete the row in question themselves by using sql. The problem row has a group_list_id = 1000010 and a group_id = 1000010. To help our database team fix the upgrade issue we would like the customer to send us their 4.8 database. This is not an LEB issue. Thanks, Tom 11/14/2001 9:43:25 AM tlarosa Hi Robin, Steve asked me to move this ticket to you. Tom 11/15/2001 4:51:53 PM bmiller From: Normandin, Jason Sent: Thursday, November 15, 2001 11:19 AM To: Miller, William Subject: LE group issue Importance: High Bill. Have the customer run the following sql statement: select group_list_id from nh_group_list_members where group_id not in (select group_id from nh_group) * The basic problem is that there are group_id associated with group_list_ids that do not exist in the nh_group table. This is therefore causing the relationship between the tables to fail. -Jason 11/15/2001 4:52:14 PM bmiller From: Miller, William Sent: Thursday, November 15, 2001 3:39 PM To: 'dchall@tecoenergy.com' Cc: Witte, John; Cole, Randall Subject: Ticket #56029: LE group issue Hi DeAnna, This is a follow up on ticket number 56029 concerning the LiveExceptions database problem that we have come across. The basic problem is that there are group_ids in the database associated with group_list_ids that do not exist in the nh_group table. This is causing the relationship between the tables to fail. In order to resolve this problem, please do the following on the 5.0 machine that is currently experiencing the problem: 1) If any LiveExceptions browsers are open, please close them. 2) Open up an sql session with the database from the command line with the command: sql e.g: sql ehealth You should get a new command prompt that looks like the following: continue * 3) Issue the following sql command to get a listing of the problematic rows that we are searching for: select group_list_id from nh_group_list_members where group_id not in (select group_id from nh_group)\g It is likely that you will see seven rows in the output. Please let me know right away if you do not see any output from this command. 3) We must now delete t< he problematic rows from the database using the following command: delete from nh_group_list_members where group_list_id in (select group_list_id from nh_group_list_members where group_id not in (select group_id from nh_group))\g 4) Issue the command 'commit\g' from the command prompt: 5) Run the command from step 2 again to see if any of the problematic rows remain. Please let me know right away if any of the rows do remain. 6) Log out of the sql session by issuing the command '\q' from the comand prompt: After this procedure is finished, please launch LiveExceptions. You should be able to access your technology types and groups, and should be able to associate rules profiles with groups. Please let me know if you have any questions about these steps, and also please let me know if this helps to resolve the problem. Thank you, Bill 11/16/2001 11:11:53 AM mfintonis updating status to more info. need to find out if customer is up and running on workaround. if so can we de-escalate? 11/20/2001 9:49:12 AM rtrei When I looked at the query, I realized that the intended fix for 17990 had actually caused a far more severe bug. Discused with managers. We have decided to cut new CDs a do a 5.0.2 release. 11/27/2001 9:45:26 AM rtrei Code has been checked in. j 11/15/2001 2:51:27 PM wburke The following is an except from customers nh_rlp_boundary: |ST | 0| 1005703200| 1005706799| 1005703231| 1005703241| | 0| |ST | 0| 1005768000| 1005771599| 1005771599| 1005771599| From table.out: nh_stats0_1005706799 nhuser table nh_stats0_1005706799_ix1 nhuser index nh_stats0_1005706799_ix2 nhuser index nh_stats0_1005710399 nhuser table nh_stats0_1005710399_ix1 nhuser index nh_stats0_1005710399_ix2 nhuser index nh_stats0_1005713999 nhuser table nh_stats0_1005713999_ix1 nhuser index nh_stats0_1005713999_ix2 nhuser index nh_stats0_1005717599 nhuser table nh_stats0_1005717599_ix1 nhuser index nh_stats0_1005717599_ix2 nhuser index nh_stats0_1005721199 nhuser table nh_stats0_1005721199_ix1 nhuser index nh_stats0_1005721199_ix2 nhuser index nh_stats0_1005724799 nhuser table nh_stats0_1005724799_ix1 nhuser index nh_stats0_1005724799_ix2 nhuser index nh_stats0_1005728399 nhuser table nh_stats0_1005728399_ix1 nhuser index nh_stats0_1005728399_ix2 nhuser index nh_stats0_1005731999 nhuser table nh_stats0_1005731999_ix1 nhuser index nh_stats0_1005731999_ix2 nhuser index nh_stats0_1005735599 nhuser table nh_stats0_1005735599_ix1 nhuser index nh_stats0_1005739199 nhuser table nh_stats0_1005739199_ix1 nhuser index nh_stats0_1005739199_ix2 nhuser index nh_stats0_1005742799 nhuser table nh_stats0_1005742799_ix1 nhuser index nh_stats0_1005742799_ix2 nhuser index nh_stats0_1005746399 nhuser table nh_stats0_1005746399_ix1 nhuser index nh_stats0_1005746399_ix2 nhuser index nh_stats0_1005749999 nhuser table nh_stats0_1005749999_ix1 nhuser index nh_stats0_1005749999_ix2 nhuser index nh_stats0_1005753599 nhuser table nh_stats0_1005753599_ix1 nhuser index nh_stats0_1005753599_ix2 nhuser index nh_stats0_1005757199 nhuser table nh_stats0_1005757199_ix1 nhuser index nh_stats0_1005757199_ix2 nhuser index nh_stats0_1005760799 nhuser table nh_stats0_1005760799_ix1 nhuser index nh_stats0_1005760799_ix2 nhuser index nh_stats0_1005764399 nhuser table nh_stats0_1005764399_ix1 nhuser index nh_stats0_1005764399_ix2 nhuser index nh_stats0_1005767999 nhuser table nh_stats0_1005767999_ix1 nhuser index nh_stats0_1005767999_ix2 We need to reconstruct the nh_rlp_boundary table to include these. 11/15/2001 4:14:20 PM wburke tables.out and rlp_boundary on BAFS/56476/11.15.01 11/16/2001 11:18:11 AM wburke declare global temporary table session.missed as select table_name from iitables i where table_name like 'nh_stats%99 %' and table_name not like '%ix%' and not exists (select table_name from nhv_stats_tables s where i.table_name = s.table_name) on commit preserve rows with norecovery; declare global temporary table session.raw as select max_range = int4(right (table_name, size(squeeze(table_name)) - 10)) from session.missed where table_name like 'nh_stats0%' on commit preserve rows with norecovery; declare global temporary table session.hour as select max_range = int4(right (table_name, size(squeeze(table_name)) - 10)) from session.missed where table_name like 'nh_stats1%' on commit preserve rows with norecovery; declare global temporary table session.day as select max_range = int4(right (table_name, size(squeeze(table_name)) - 10)) from session.missed where table_name like 'nh_stats2%' on commit preserve rows with norecovery; insert into nh_rlp_boundary select 'ST', 0, max_range - 3600, max_range, max_range - 3600, max_range, '', 0 from session.raw\g insert into nh_rlp_boundary select 'ST', 1 , max_range - 86400, max_range, max_range - 86400, max_range, '', 0 from session.hour\g insert into nh_rlp_boundary select 'ST', 2 , max_range - (86400*7), max_range, max_range - (86400*7), max_range, '', 0 from session.day\g 11/16/2001 2:05:30 PM yzhang get script from ~yzhang/scripts/reconstruct_rlp_boundary.sh, I did test with empty database, which means the systax is fine, just type the script name to run the script after login as nhuser and sourcing. You can create the customer situation, ie, remove some records from nh_rlp_boundary table, then run the script, to see if you actually insert the data you want. Thanks Yulun 11/16/2001 3:30:39 PM wburke -----Original Message----- From: Burke, Walter Sent: Friday, November 16, 2001 3:21 PM To: 'Cynthia.badgett@hqasc.army.mil' Subject: Ticket # 56476 -dataRestore Cynthia, Please run a manual save of the dataBase call it support. i.e. nhSaveDb -p $NH_HOME/db/save/support.tdb nethealth as $NH_USER source nethealthrc.csh next place this script in $NH_HOME. chmod 777 rlp.sh dos2unix rlp.sh rlp.sh nhServer stop echo "select * from nh_rlp_boundary\g" | sql nethealth > rlpA.out run script. echo "select * from nh_rlp_boundary\g" | sql nethealth > rlpB.out echo "help\< g" | sql nethealth > tables.out << File: rlp.sh >> nhServer start run reports. send all outfiles. Any questions please call. Sincerely, Walter 12/3/2001 1:48:03 PM cestep Once the rollup boundary table was changed to reference the tables, it was fixed. This can be closed. 12/4/2001 3:02:14 PM yzhang Colin This is your last update, do the script fix nh_rlp_boundary table? Once the rollup boundary table was changed to reference the tables, it was fixed. This can be closed. 12/4/2001 4:25:11 PM yzhang last thing we want to find out is why the reference in the nh_rlp_boundary was missing. Did they update nh_rlp_boundary table, did the stats0 tables that not referenced from nh_rlp_boundary table were loaded from another server? and all possiblities you can think. check with customer on this. Thanks Yulun 12/11/2001 4:32:26 PM yzhang Colin, This is my last update, If they don't provide the information by tomorrow, we will close the ticket last thing we want to find out is why the reference in the nh_rlp_boundary was missing. Did they update nh_rlp_boundary table, did the stats0 tables that not referenced from nh_rlp_boundary table were loaded from another server? and all possiblities you can think. check with customer on this. 12/12/2001 10:54:21 AM yzhang support ask to close this one. 11/15/2001 3:42:22 PM dwaterson Fetch failing with duplicate element_id's: Descr: I did another manual fetch without any other remits and this time it complained that the NH_SERVER_ID needs to be changed and we need to destroy the database and start again. error: dipnhpo1::/opt/Nethealth/db/remotePoller/Remote.tdb.11-14-2001_12.10.12 ERROR: Both dipnhpo1 and susno049 have elements in the same range. Note: Walter Burke has been working on this issue with Yulan. All relevant information is on BAFS: /56000/56434 11/15/2001 6:45:50 PM yzhang on the remote site: 1) on the remote site: nhRemoteSaveDb -h will give you all possible options you can use, the simple command is nhRemoteSaveDb -g -gl db_name the db_name should equal to env | grep RDB from remote site I saw your remotedb save failed from the mail you sent this morning after save, go to remotePoller/remote.tdb/*tar, to verify the group is inside of the tar file. also check for errlog in remotePoller/remote.tdb you can do this manually for one remote, and be sure remove file.copied from remote and file.processed from central before fetch. the save way is to remove the whole remotePoller directory on the central site before fetching. second question: the fetch and maintenance has no directory relationship. what maintenance job will do is to stop and restart nhServer, if you set NH_RESET_INGRES (environment variable) to yes, then it also stop and restart ingres. Walter, find out if he set this variable, and get the latest maintenance log, also nhFetchDb requires ingres to be running. I am not quit understand your third question, Walter, see if you can help. Thank Yulun 11/16/2001 12:11:04 PM wburke Vinesh destroyed and recreated the dB. The error occured most likely due to an inccorect SERVER_ID set in one of the rc files. Re-Fetching with the new database worked correctly. 11/16/2001 1:16:03 PM yzhang problem solved 11/15/2001 3:45:52 PM dwaterson Issue: RemoteSaves with -g and -gl options fails: Still cannot load group and group-list. Customer stated: We implemented a nhFetchDb and nhiCheckDupStats produced by Yulun. It worked with existing remotes but does not work with the new remote. Plus we are having problems with groups and group-list. Note: Walter and Yulun have been working this issue. Relevant information is on BAFS: /56000/56432 11/16/2001 10:44:18 AM yzhang Thank you for the appreciation of Concord support, we will keep working with you continusoely for whatever problem you have. Regarding the group and group list. If you open remotestats.tar as: tar tvf remotestats.tar you will see groups.zip, and glists.zip, including the file size, this means you have saved the group and grouplist succesfully. note tar tvf remotestats.tar just shows you what in the tar file. If you want completely untar you need to use tar xvf remoteStats.tar, but don't use tar xvf on the remote site. Thanks Yulun 11/16/2001 11:26:11 AM wburke -----Original Message----- From: Latchman, Vinesh [mailto:Vinesh.Latchman@team.telstra.com] Sent: Friday, November 16, 2001 1:36 AM To: 'Zhang, Yulun'; Burke, Walter Subject: RE: 56434 - Fetch failing with duplicate element_id's Yulun, I have changed the central site nethealthrc.sh.usr file. I will advise you on Monday whether this has fixed the problem. To check if the groups & group-list are included in the remoteSaveDb, when I untar the file remotestats.tar, which file should I then unZip to see if the groups are there. I tried looking into one unzipped file from untar of remotestats.tar and I had problems opening the file. Vinesh Latchman Project Manager, Hosting & Internet, IS, Telstra Ph: (03) 9634 6294 11/19/2001 2:54:51 PM yzhang Vinesh, nhRemoteSaveDb -g -gl nethealth will save the group and grouplists into *zip files if you modified the group and group list after last remotesave. otherwise, you will not see groups.zip and glist.zip. I think your problem is that the previouse fetch against seccesufl save with g and gl was failed, and you did not modify the group and group list on the remote site, so when you do the save again, there will be no group and grouplist saved, and no group and group list will be merged into the central when fetching: you can manually add the group and grouplist into the central. you already did this. Walter, If Vinesh still has the problem, I can write a script for him to add the group and grouplist zip file into the remotestats.tar file on the remote. Thanks Yulun 11/20/2001 9:53:30 AM wburke -----Original Message----- From: Latchman, Vinesh [mailto:Vinesh.Latchman@team.telstra.com] Sent: Monday, November 19, 2001 7:17 PM To: 'Zhang, Yulun'; Burke, Walter Subject: RE: nhFetchDb group & grouplist problem All, My tests confirms what been said in email below. Nethealth only sends updates. So if there is a group called "test" at the remote but not at the central then nethealth will not add the group called "test" at the central until at the remote "test" is updated with new elements. What happens when at the remote, elements are deleted from a group. Will the group at central get updated. What happens if the group is deleted at the remote. Will the group be deleted at the central. Vinesh Latchman Project Manager, Hosting & Internet, IS, Telstra Ph: (03) 9634 6294 11/20/2001 10:31:48 AM wburke -----Original Message----- From: Latchman, Vinesh [mailto:Vinesh.Latchman@team.telstra.com] Sent: Monday, November 19, 2001 7:17 PM To: 'Zhang, Yulun'; Burke, Walter Subject: RE: nhFetchDb group & grouplist problem All, My tests confirms what been said in email below. Nethealth only sends updates. So if there is a group called "test" at the remote but not at the central then nethealth will not add the group called "test" at the central until at the remote "test" is updated with new elements. What happens when at the remote, elements are deleted from a group. Will the group at central get updated. yes What happens if the group is deleted at the remote. Will the group be deleted at the central. Yes Vinesh Latchman 11/20/2001 10:35:33 AM yzhang problem solved 11/19/2001 2:07:33 PM foconnor Excerpt from the errlog.log file: E_SC0129_SERVER_UP Ingres Release II 2.0/9808 (hpb.us5/00) Server -- Normal Startup. SLSANH ::[49356 , 40a5a900]: Wed Oct 31 09:12:48 2001 E_DM930C_DM0P_PAGE_CHKSUM_FAIL Page Checksum failure detected. Database nethealth, Table nh_stats0_1003953599, Page 332. Stack dmp name 49356 pid 2348 session 0: 7F448B10: IICSMTp_semaphore(00000078,00000050,00000040,00000001) Stack dmp n< ame 49356 pid 2348 session 0: 7F448B10: IICSp_semaphore(00000078,00000050,00000040,00000001) Stack dmp name 49356 pid 2348 session 0: 7F448A90: dm0s_mlock(00000000,0000003C,00000008,00000000) Stack dmp name 49356 pid 2348 session 0: 7F448990: dm2f_sense_file(403AC774,0000001D,00000008,00000000) Stack dmp name 49356 pid 2348 session 0: 7F448850: gforce_page(403AD068,403AD008,00000008,00000000) Stack dmp name 49356 pid 2348 session 0: 7F448690: gfault_page(40366380,40359DE8,7F447728,00000008) Stack dmp name 49356 pid 2348 session 0: 7F4483D0: dm0p_cachefix_page(00000000,403B19B0,4038E910,4036630C) Stack dmp name 49356 pid 2348 session 0: 7F448110: dm0p_fix_page(00000000,406554E0,4064F0A0,00000000) Stack dmp name 49356 pid 2348 session 0: 7F447E50: dm1b_get(422A3150,00000000,00000000,00000000) Stack dmp name 49356 pid 2348 session 0: 7F447D90: dm2r_get(00000001,00000001,FFFFFFFD,40664880) Stack dmp name 49356 pid 2348 session 0: 7F447850: dmr_get(00000000,00000007,00000000,7F447674) Stack dmp name 49356 pid 2348 session 0: 7F447790: dmf_call(00000000,00000000,00000002,00000000) Stack dmp name 49356 pid 2348 session 0: 7F4476D0: qen_orig(7F4475D8,00000004,00000000,7F44709C) Stack dmp name 49356 pid 2348 session 0: 7F447590: qen_orig(7F446E78,00000000,00000002,00000000) Stack dmp name 49356 pid 2348 session 0: 7F447490: qeq_fetch(41870200,40E63F08,40009DD8,40009DD0) Stack dmp name 49356 pid 2348 session 0: 7F4473D0: qef_call(407CBE60,41969480,00000800,00000000) Stack dmp name 49356 pid 2348 session 0: 7F447350: scs_sequencer(7F446C38,7F446BBC,00012000,00000001) Stack dmp name 49356 pid 2348 session 0: 7F446B10: scs_sequencer(41964FA0,00000000,00000000,00000000) Stack dmp name 49356 pid 2348 session 0: 7F445350: CSMT_setup(00000003,41964FA0,419650E4,00000000) Stack dmp name 49356 pid 2348 session 0: 7F445090: C1179D58(00000000,00000000,00000000,00000000) SLSANH ::[49356 , 41964fa0]: Bus Error (SIGBUS) IICSMTp_semaphore(0x4d5ad4) @ PC = 4d90e8, SP = 7f448b10, PSW = 4001f OPERATING SYSTEM: HP-UX 11.0 64bit HARDWARE MODEL: 9000/800/N4000-36 MEMORY: 1 GB VERSION: 4.8 INSTALLED PATCHES: Patch: P04/D04 11/19/2001 2:14:03 PM foconnor Spoke to Yulun. Requested files from customer. 11/19/2001 3:03:18 PM yzhang requested nhCollectCustData.tar 11/20/2001 6:05:39 AM foconnor Information is on: //BAFS/escalated tickets/56000/56089 There is nothing spectacular in the files. 11/20/2001 6:05:59 AM foconnor - 11/27/2001 5:48:47 PM yzhang Have customer change ii.slsanh.dbms.*.stack_size from 131072 to 262144 through cbf, then recycle ingres (ingstop, then ingstart). see if this take care the stack dump 12/3/2001 10:45:10 AM yzhang Farrell, I recommanded the following last time, does it work? Have customer change ii.slsanh.dbms.*.stack_size from 131072 to 262144 through cbf, then recycle ingres (ingstop, then ingstart). see if this take care the stack dump Yulun 12/17/2001 3:35:38 PM mwickham Customer's problem is solved...please close this problem ticket. 12/19/2001 1:38:57 PM apier Customer increased the stack size and the problem was resolved. In Yulun's absence I am closing the Problem Ticket 11/19/2001 3:11:20 PM beta program PWC: Don Mount donald.d.mount@us.pwcglobal.com 813-348-7252 Scheduled maintenance failure. Scheduled for 9 AM Sunday. Default schedule. 11/18/2001 09:00:37 CuOcFailedExec Pgm nhiDbServer: Error: Exec of `/opt/concord/neth/bin/nhReset` failed. uxpwcapp4% more Maintenance.100004.log ----- Job started by User at `11/18/2001 09:00:37 AM`. ----- ----- $NH_HOME/bin/nhReset ----- interests = General e-business how-did-u-hear = Electronic newsletter publication = Select one ad = Select one product-interest = Service Provider concern = e-business Growth salescall = No futuremails = Yes 11/19/2001 6:31:22 PM rlindberg this is fixed in B2. 2/26/2002 10:40:20 AM Betaprogram Customer Verified this is Fixed in Beta 4 11/19/2001 4:28:15 PM dwaterson Issue: Conversation Rollups failing silently. customer sees: 11/13/2001 12:06:10 DbsOcDbJobStepFailed Error: Job step 'Conversations Rollup' failed (the error output was written to /concord/neth/log/Conversations_Rollup.100001.log Job id: 100001). No errors written to the conversation rollup log. Conversations_Rollup.100001.log ----- Job started by Scheduler at '11/13/2001 12:05:37 PM'. ----- ----- $NH_HOME/bin/sys/nhiDialogRollup -u $NH_USER -d $NH_RDBMS_NAME ----- ----- Scheduled Job ended at '11/13/2001 12:06:11 PM'. ----- Ingres Error log has not been written to since November 1, 2001. See Bafs for error log and tables.out. 12/3/2001 3:01:16 PM cestep Any updates on this one? 12/4/2001 10:55:12 AM wburke -----Original Message----- From: Burke, Walter Sent: Tuesday, December 04, 2001 10:45 AM To: Trei, Robin Cc: Chapman, Sheldon Subject: 19307 - DialogRollups fail silently Hi Robin, This is a perplexing problem as there is no error's written to the ingres errlog or job log. 11/13/2001 12:06:10 DbsOcDbJobStepFailed Error: Job step 'Conversations Rollup' failed (the error output was written to /concord/neth/log/Conversations_Rollup.100001.log Job id: 100001). No errors written to the conversation rollup log. Any ideas? Thanks, 12/4/2001 11:04:17 AM wburke -----Original Message----- From: Trei, Robin Sent: Tuesday, December 04, 2001 10:54 AM To: Burke, Walter; Karten, Eric; Zhang, Yulun Cc: Chapman, Sheldon Subject: RE: 19307 - DialogRollups fail silently I'm betting it may have been duplicates...I think Eric Kartin (who owns the conversation/TA area) and Yulun have been working with a customer in this area. Don't know if it is an exact match or not, but cc'ing them for their opinions. 12/4/2001 11:17:22 AM yzhang This problem is from customer called Price water house, It is the same problem happened in the past, it now happen again. We have tryed all the ways in the past, including cleannode, set nh_unref_node_limit, and so on. Also they used to polled internet probe that our product does not support. This ticket currently assigned to Robin, I suggest assign to Eric. Yulun 1/7/2002 8:51:48 PM rtrei Eric-- I'm not sure you have looked into this or not..I was not aware that it had been assigned ot me. It could be a duplicate problem as they suggest, but I suspect it might be the case that they are blowing the stack. If this happens they need to ulimit their stack-- that usually works on solaris. I am assigning it to you because this area really needs to be rewritten at some point, and I don't know if it is on your 5.6 or 6.0 list. 1/8/2002 10:05:41 AM wburke -----Original Message----- From: Trei, Robin Sent: Monday, January 07, 2002 8:44 PM To: Burke, Walter Cc: Karten, Eric Subject: 19307/56384 Walter-- Ticket 19307 deals with conversations rollups silently failing. I was not aware that this ticket was assigned ot me, so sorry for the long silence. I am actually assigning it to Eric for the initial investigation although it could well come back to me. However, the first thing I think you should do is to check that the conversations rollup is not blowing the stack. (You can check by looking for a core fiile for conversations rollup). If this is what it is, the workaround is straightforward and is documented in the db worksheets. Basically, you just have to ulimit the stack. 1/8/2002 11:03:08 AM ekarten Please follow Robin's instructions. The bug is now in more info. 1/9/2002 11:13:50 AM wburke -----Original Message----- From: Burke, Walter Sent: Wednesday, January 09, 2002 11:02 AM To: 'donald.d.mount@us.pwcglobal.com' Subject: Ticket # 56834 - Coversations rollups fail silently Hi Don, I have been re-assigned this issue. 1) If the rollups are still failing, please search the eHealth drive for any core files. WE believe that we may have a blown memory stack that is causing the failure 2) if a core file is found, please r< un file This will tell us which executable caused the dump. Sincerely, 1/16/2002 8:59:54 AM wburke -----Original Message----- From: donald.d.mount@us.pwcglobal.com [mailto:donald.d.mount@us.pwcglobal.com] Sent: Wednesday, January 16, 2002 8:49 AM To: WBurke@concord.com Subject: Re: Ticket # 56834 - Coversations rollups fail silently Walter, I am still running OK since I created a new database the first fo the month. I will send as soon as the rollup's start failing. 1/16/2002 9:18:10 AM dbrooks close per critical bug meeting. 1/17/2002 12:24:21 PM mfintonis PWC 5.5 Bug 19971 We are looking for a short term solution (such as tuning Oracle) to the problem reported by PWC in problem ticket 19971. Please note that PWC has another critical problem ticket 19307 against 4.7 that shows similar symptoms. Reopening this ticket for now because of note above 1/22/2002 3:46:55 PM ekarten Although this may be related to 19971 I am closing this since they have not seen this problem in the production environment for quite some time. I also understand that they are at 4.8 in the production environment. 11/20/2001 10:36:00 AM mmcnally Scheduled dataAnalysis log filling up with error. Job Step Analysis is failing. 10/18/2001 01:52:29 DbsOcDbJobStepFailed Error: Job step 'Data Analysis' failed (the error output was written to /opt/nethealth/log/Data_Analysis.100002.log Job id: 100002). Customer recreated Health report that is attached to this job. Still receives errors. Also rescheduled the job with no changes. Associated files can be loacted on BAFS/55400/55404 11/20/2001 10:58:33 AM yzhang Mike, Please do the following: 1) find out if groups and grouplists under $NH_HOME/reports match the groups and grouplists on the console. if it does not match, remove whatever group and grouplists, which don't appears on console from $NH_HOME/reports , 2) check that they have correct time zone 3) run dataanalysis in the advanced debug, send the output file Thanks Yulun 1/22/2002 12:08:57 PM yzhang requested to check time zone and do the advanced debug 1/31/2002 9:24:40 AM mmcnally -----Original Message----- From: McNally, Mike Sent: Tuesday, January 29, 2002 9:25 AM To: Zhang, Yulun Subject: PT 19321 Yulun, The customer says no advanced logs were created. He is looking in the correct location: $NH_HOME/log/advanced Let me know what we should do next. Thanks, Mike 2/26/2002 12:27:13 PM yzhang Can you check if customer still has the same problem, if so, do a practice on your machine to find out how to get the advanced debug log from scheduled DA. see the debug.cfg file as a guide under $NH_HOME/sys, or talk to somebody in the support to find out how. Thanks Yulun 3/12/2002 11:46:04 AM mmcnally requested customers database to load on my test machine. 4/2/2002 5:00:19 PM yzhang Mike, You mentioned that you loaded the customer's db on your machine, would you be able to reproduce customer's problem? 4/10/2002 8:05:17 AM foconnor Customer upgraded and problem went away. Z11/20/2001 12:19:39 PM cestep Some of the customer's stats0 tables are missing indices. It's about 3 days worth of data that is not indexed: From Fri Nov 09 12:59:59 2001 --> Mon Nov 12 02:59:59 2001 During indexing we get the following error: Begin processing (20/11/2001 16:59:05). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Tue Nov 20 11:02:47 2001) ). Sent them the nhCleanDupStats script from Yulun. After running the script, we checked the tables, and they are still not indexed. Files are on BAFS, under ticket #56428/11.20.01 11/20/2001 12:38:09 PM yzhang what is the output from nhCleanDupStats 11/20/2001 2:45:17 PM cestep There is no output. 11/20/2001 3:11:37 PM yzhang run ~yzhang/scripts/cleanStats_mod.sh clean make sure run it like this cleanStats_mod.sh clean 11/26/2001 8:56:52 AM cestep This issue was resolved. 11/26/2001 8:57:03 AM cestep changing to assigned. 11/27/2001 5:50:23 PM yzhang This is stats rollup, right? you can do the following : 1) check to make sure they have enough disk space 2) If they have enough disk spcae , you need to obtain the database, this is a problem in which the duplicate came from rollup, you can not see it from the table list. So get the database and load it into the same platform and nethealth version, let me know when you are at this point. Thanks Yulun 12/3/2001 11:01:51 AM cestep The cleanStats_mod clean worked. The tables were indexed, and problem resolved. This ticket can be closed. 12/3/2001 11:17:59 AM yzhang problem solved 12/4/2001 3:09:47 PM cestep re-opening this ticket. The behavior came back. 1/15/2002 3:32:41 PM yzhang customer was running into disk problem, that has currently resolved. this ticket will be closed *11/20/2001 1:55:55 PM jnormandin Customer installed 5.0.1 clean after running 5.0 beta 4. ( Saved DB from 5.0 b4 and uninstalled and the installed 5.0.1 clean. ). Customer then loaded the db into the new 5.0.1 system and the server would stop with the " Server stopped unexpectidley, Restarting" message. This would then fail with 'Console initialization failed' error. An older version of the db ( 2 months ago ) was loaded, and no problem was noted with this db and the server and console started without incident. Given this information, I had the customer send me the DB from 5.0 b4 to analyze in-house. I loaded this db into a 5.0.1 system and received the same results. After debugging the situation, it appears as though the nhiLiveExSvr is dying and thusly causing the nhiServer to shutdown. This was verified by commenting out the startup entry for the nhiLiveExSvr in the $NH_HOME/sys/startup.cfg file. Once this was done, the server started without incident. Information received from debugging the nhiServer, nhiLiveExSvr, nhiCfgServer, nhiMsgServer and nhiDbServer can be found on Bafs\escalated tickets\56000\56162\InHouse directory. A summary of these files is as follows: FROM nhiServer debug ( -Dall flag used ) [s,cba ] Detected program 8077 died, status = 11 [s,cba ] Invoking SIGCHLD callback for pid = 8077. [t,tb ] E:SpsExecuteList::programDiedCb (pid) [d,sps ] Program 8077 died at 11/20/01 11:43:24 [t,tb ] E:SpsProgramDescription::SpsProgramDescription (pid) [t,tb ] X:SpsProgramDescription::SpsProgramDescription (pid) [D,sps ] Dead program: [D,sps ] file = /ehealth/ehealth506/bin/sys/nhiLiveExSvr [D,sps ] pid = 8077 [D,sps ] started = 11/20/01 11:41:32 [D,sps ] args = -Dall [D,sps ] restart = all [D,sps ] wait = 5 FROM nhiLiveExSvr debug ( -Dall flag used ). LAST QUERY RUN [d,du ] Initializing query : 'select a.profile_id, a.alarm_id, a.attribute_type, a.attribute_oper, a.value from nh_alarm_attribute a where a.profile_id = 1000002 order by alarm_id' [Z,du ] sqlca.sqlcode: 0 [Z,du ] rows: 0 [Z,du ] sqlca.sqlcode: 0 [Z,du ] rows: 0 [Z,du ] sqlca.sqlcode: 0 [Z,du ] rows: 0 [d,du ] Begin transaction level 1 [d,du ] Initialized query for cursor name nhQuery30. [Z,du ] sqlca.sqlcode: 100 [Z,du ] rows: 0 [Z,du ] sqlErrorCode: 100 [Z,du ] sqlErrorText: [Z,du ] sqlErrorMsg : 100, [d,du ] Ending query nhQuery30. [Z,du ] sqlca.sqlcode: 0 [Z,du ] rows: 0 [d,du ] Ended query nhQuery30. [d,du ] Committing database transaction ... [d,du ] Committed. [d,du ] End transaction level 1 - This query runs successfull during an sql sessions and returns 0 rows. - NO MORE INFO IS IN FILE AFTER THIS QUERY FROM nhiMsgServer debug ( -Dall flag used ) Only error present in this file is: [i,cba ] Processing standard application args ... < [i,cba ] Initializing base application library ... [s,cba ] Initalizing pipe-based signal handling... [s,cba ] Signal initalization complete. [i,cba ] Initializing application ... [d,lmgr] In lmInit [d,lmgr] Poller Feature msg error = (0,0:2 "No such file or directory") - nhiDbServer debug listed no relevant errors or failures - DEBUG from nhiCfgServer shows the following last transaction logged: [Z,poller] Read new poller.cfg: [Z,poller] [s,cba ] Signal handler invoked for signal = 2, writing to pipe - Ingres error log presents no errors or other information for any transactions around failure time. Even though the customer is now successfully polling information with the older database, he wishes to reload this archive so that he has the 2 months of data missing from the older version. 12/3/2001 4:25:14 PM rtrei Jay, Jason-- I believe ticket 19333 is caused by the fact that the nh_alarm_attribute table does not contain any rows with a profile_id of 1000002. The nhiLiveExSvr queries this table and dies immediately after the empty row result is returned. (Jay, perhaps we should make the nhiLiveExSvr more rebost in a patch.) The profile_id is in many of the other associated tables, so it clearly expected the results to be in this table as well. I reviewed the code that upgrades these, and it looks good to me. (Jay can doublecheck if he likes.) I suspect this problem is a result of their being a beta site. I don't think we support beta information to become production. It is the nature of beta sites to have erros in their data due to the problems we solved and fixed. However, if this customer is just testing this version, I believe if he does the following commands things hsould work again. This removes his profiles, but as he only has 2 I don't think this will be too hard to recreate. echo " delete from nh_alarm_rule where profile_id > 1000000; commit\g" | sql ehealth repeat the command using the following table names nh_alarm_threshold nh_alarm_event_threshold nh_alarm_attribute nh_exc_profile nh_exc_profile_assoc 12/5/2001 11:00:28 AM rtrei Jay responded that having the nh_alarm_attribute table empty should not have caused this on its own. So, I retraced it and the failure happens when it is trying to add the profile 1000002 -- some data somewhere is bad and causes RW to core. Reassigning this to Jay as he can determine next steps. Here is the email stream: A copy of the saved database is in /net/noway/export/noway2/nh50/56162 From my understanding if the problem ticket, this was done recently: they upgraded 5.0.1 RTM on their beta software, then started having server crashes. They then loaded an older saved beta database and the crashes went away. > -----Original Message----- > From: Wolf, Jay > Sent: Wednesday, December 05, 2001 10:13 AM > To: Trei, Robin > Cc: Mitchell, Richard; McAfee, Stephen; LaRosa, Tom > Subject: RE: 19333 > Importance: High > > Robin, > > I'm sending this to Steve, Rich and Tom as well because it > looks like a problem similar to ProbT17688 where SBC also had > corrupted nh_alarm_event_threshold rows written to the > database. At that point I couldn't deduce how they had > gotten written that way and based on timestamps wrote it off > to an early beta. My concern here is that we now have two of > these and that it causes LE to core (because there is a bad > string), which brings down the eHealth server (nhiLiveExSvr = > restart all). > > The parallels between the two tickets are that > nh_alarm_event_threshold is corrrupted. That it looks like a > copy of our system delivered profile and that the botched > values in the are the same alarm_seq_nmbr = -32624156 or > close event_type = "_E". > > Tom, we may want to look at the classes which copy alarm > rules in Java and test whether we can copy from a system > profile successfully. > > Jay > > > It looks like corrupt data in the nh_alarm_event_threshold > table< . Hard to say how long it has been there. The only > clues are that the profile was created on October 15 and no > rules have been added or deleted since then to any profile in > the system. It's a little disconcerting that it was as > recent as October 15. > > Where is the database save at? I would like to find the nvr* > file to take a look at what revision of the software they > were at. Hopefully it was an earlier beta, but if they > installed a recent beta since 10/15, we won't know. > > Here are the bad values in this row: > lqqqqqqqqqqqqqwqqqqqqqqqqqqqwqqqqqqqqqqqqqwqqqqqqqqqqqqqqqqqq > qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqwqqqqqqqqqqqqqqq > qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqwqqqqqqqqqqqq > qwqqqqqqqqqqqqqwqqqqqqqqqqqqqwqqqqqqqqqqqqqwqqqqqqqqqqqqqwqqqq > qqqqqqqqqk > xprofile_id xalarm_id xalarm_seq_nmbxevent_type > > xevent_name > xcondition_typxtimeout > xtimeout_unitsxignore_clear_xthreshold_ratxrate_units_spx > tqqqqqqqqqqqqqnqqqqqqqqqqqqqnqqqqqqqqqqqqqnqqqqqqqqqqqqqqqqqq > qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqnqqqqqqqqqqqqqqq > qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqnqqqqqqqqqqqq > qnqqqqqqqqqqqqqnqqqqqqqqqqqqqnqqqqqqqqqqqqqnqqqqqqqqqqqqqnqqqq > qqqqqqqqqu > x 1000002x 1000012x -32624156x_8 > xID > x > 0x 2x -4263388x > 4x 13511968x 12x > mqqqqqqqqqqqqqvqqqqqqqqqqqqqvqqqqqqqqqqqqqvqqqqqqqqqqqqqqqqqq > qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqvqqqqqqqqqqqqqqq > qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqvqqqqqqqqqqqq > qvqqqqqqqqqqqqqvqqqqqqqqqqqqqvqqqqqqqqqqqqqvqqqqqqqqqqqqqvqqqq > qqqqqqqqqj > > Starting with alarm_seq_nmbr things go bad. There is nothing > the LE Engine can do to defend against this garbage data. It > turns out that the strings that are passed from the database > screw up RogueWave strings. > > Don't know how this could have gotten contaminated other than > by a bad nhWeb write to the database. Maybe we need to start > adding checksums on the Java communication? > > Rich, I think we need to keep our eyes on this and if we do > get crashes early on in the LE Engine's startup, always query > for nh_alarm_event_threshold > 999999 for suspicious rows. > > Jay > > -----Original Message----- > From: Trei, Robin > Sent: Tuesday, December 04, 2001 12:27 PM > To: Wolf, Jay > Subject: 19330 > > > Jay-- > > I've looked further into the LiveExSvr crash, It is related > to the data that is in the customer database. > It seems to be crashing here: > LexAlarmRuleLookup::addEvalAlarmRule > LexEvalAlarmRule::LexEvalAlarmRule > LexEvalAlarmRule::operator=(this = 0xb2e268, r = CLASS) > LexEvalPolledAlarmCond::LexEvalPolledAlarmCond(this..., r=CLASS) > LexEvalPolledAlarmCond::operator=(this..., r=CLASS) > RWCString::operator= > signal SEGV (...) in RWCString::operator= at line 512 in > cstring.cpp > > I've just captured the failure, haven't yet stepped through > it. Do you want me to continue, or do > you want to take it from here? You may want to telnet into > noway to look at the data. It is > definately somewhere in what is defined for profile_id 1000002. > > The db is in /export/noway2/nh50... nhuser is rtrei (PW: > glasto1) db name is eHealth. B11/20/2001 4:38:05 PM rrick Problem: Customer get Lock Quota exceeded after 2 minutes after bringing up the nhServer. There health reports also then fail with the following error: Error: Invalid group file 'Could not get modified time from nh_group table'. Error: Invalid group file 'Could not get modified time from nh_group table'. Error: Invalid group file 'Could not get modified time from nh_group table'. Report failed. Appending 50 lines of the Ingres errlog.log file. The following is happenning on all 10 servers. ( of the 10 servers contain only Stats elements. The 10th contains both stats and conv. elements. Error in errlog.log: RTPCON02::[48964 , 00000001]: Mon Nov 19 17:47:19 2001 E_SC0129_SERVER_UP Ingres Release II 2.0/9808 (su4.us5/00) Server -- Normal Startup. RTPCON02::[48964 < , 0000001b]: Mon Nov 19 17:49:34 2001 E_DMA00D_TOO_MANY_LOG_LOCKS This lock list cannot acquire any more logical locks. The lock list status is 00000000, and the lock request flags were 00000008. The lock list currently holds 1000 logical locks, and the maximum number of locks allowed is 1000. The configuration parameter controlling this resource is ii.*.rcp.lock.per_tx_limit. RTPCON02::[48964 , 0000001b]: Mon Nov 19 17:49:34 2001 E_DM004B_LOCK_QUOTA_EXCEEDED Lock quota exceeded. RTPCON02::[48964 , 0000001b]: Mon Nov 19 17:50:12 2001 E_DMA00D_TOO_MANY_LOG_LOCKS This lock list cannot acquire any more logical locks. The lock list status is 00000000, and the lock request flags were 00000008. The lock list currently holds 1000 logical locks, and the maximum number of locks allowed is 1000. The configuration parameter controlling this resource is ii.*.rcp.lock.per_tx_limit. RTPCON02::[48964 , 0000001b]: Mon Nov 19 17:50:12 2001 E_DM004B_LOCK_QUOTA_EXCEEDED Lock quota exceeded. NOTE: This was increased from 700 to 1000 in the Config.dat file. 11/27/2001 6:03:29 PM yzhang Created an issue with CA with the following question We have a customer, who was runing into problem of Lock quota exceeded, I have them increase lock.per_tx_limit from 700 to 1000, but they still got the same error: Any suggestion about what is going to next? 11/29/2001 1:24:20 PM yzhang increase the parameter from 1000 to 1400 through cbf 11/30/2001 4:35:24 PM jnormandin From: Shaune Morley [mailto:smorley@concord.com] Sent: Friday, November 30, 2001 4:20 PM To: Normandin, Jason Subject: RE: call ticket # 56505 Lock issue Just increased the locks to 1400 on rtpcon00. As soon as I restarted nethealth I got an error that the locklist had been exceeded and that the lock list currently holds 1400 logical locks. Any idea what's going on? We had 700 locks under 4.8 and it worked fine. Shaune Morley 11/30/2001 6:16:41 PM yzhang Walter, I actually created issue with CA for this problem, what they recommend is to keep increasing the tx_lock parameters, looks this does not work, The other option is to set readlock to nolock. From C shell, login as ingres, customer can issue the following command. setenv ing_set "set lockmode session where readlock = nolock" my guess is that some reports takes long time and hold many shared locks so that the number of lock is getting short. If you can find out what application is running when the lock exceed error appears, then it is possible for us to find out what transaction cause the problem. Thanks Yulun 12/3/2001 9:53:55 AM jnormandin Yulun, These errors occur immediately after the server starts and there are no scheduled jobs,reports etc. running at the time of the errors. Is it sitll advisable to have them set readlock to nolock or should we look deeper into the cause .. -Jason 12/3/2001 4:11:05 PM yzhang Jason, Here is what we shoud do If customer's reboot does not make the same error disappear, check with customer see how are they doing on the reboot. 1) add du and cdb flag for advanced nhiServer, as Robin suggested 2) immediately after seeing the error, found the transaction/query which causes the problem, come to my office, I will show you procedure. Yulun 12/4/2001 4:15:40 PM jnormandin - Customer has stated the following: "I can only reboot the servers after hours and yesterday I was working on something else and forgot. I will try to reboot one of them tonight." 12/5/2001 9:01:32 AM jnormandin Rebooting did not stop the errors. See note from customer: Rebooting did not clear the lock issue. The box came back online and in the errorlog.log file it says that ingres finished Normal startup at 17:13 at 17:17 it ran out of locks, that would be sometime during eHealth startup. "The lock list currently holds 1400 logical locks, and the maximum number of locks allowed is 1400." It's almost like the counter is never clearing locks that are actually being cleared. These messages don't "seem" to impede the function of the database, but they are disconcerting. 12/5/2001 9:01:39 AM jnormandin . 12/5/2001 9:47:49 AM yzhang Jason, As I mentioned early, we need to do the following 1) add du and cdb flag for advanced nhiServer, then stop and restart nhServer, send the debug output file 2) immediately after seeing the error from starting Server, found the transaction/query which causes the problem, come to my office, I will show you procedure. Don't send this request to customer until you know the procedure Yulun 12/6/2001 11:17:41 AM wburke called customer w/ procedure, ACB 12/7/2001 4:48:14 PM wburke -----Original Message----- From: Burke, Walter Sent: Friday, December 07, 2001 4:38 PM To: Morley, Shaune Subject: Ticket # 56505 Shaune, This procedure would be best undertaken during a eHealth Naintenance window, as the lock errors will show. 1. as ingres user source nethealthrc.csh 2. ipm - F1 to obtain : prompt - : End to escape current IPM UI window - :quit to escape IPM UI 3. Choose Lock_Info - Capture SnapShot if possible. - Note ID of Locks - select Lock and get more info. - Obtain Transaction ID - End 4. Choose Server_List - Select Ingres - Select Session - Obtain SnapShot if possible - Determine the query which is locking the dB. - Should match Lock ID w/ Session ID. Sincerely, 12/10/2001 11:12:25 AM wburke Walked customer through de-bug steps. - needs to wait for maintenance window to capture lock info 12/10/2001 3:39:38 PM yzhang You mean the logical lockk will appear evry time when customer run ingstart. If this is the case then the problem will not be caused by our transaction. 12/10/2001 4:09:54 PM yzhang Talked to proServ, and he told me that the logical lock error appeared in the errlog.log every time when starting nhServer or running nhReset. he said he will get us this week the advanced debug for nhiServer abd nhiDbServer, as well as the transaction that causes the error. Yulun 12/10/2001 4:44:05 PM wburke Are we using reponse or LiveExceptions, Trap, or Notifier on these servers? Thanks, Walter 12/10/2001 5:04:38 PM wburke Using response on one server and LiveExceptions on two of them I think, but all servers are licensed for just about everything. Here are the nhi processes running on most of the servers (I checked 3 of the 10) nhiServer start nhiTrapServerCmu nhiPoller nhiPoller -dlg nhiRespServer nhiNotifierSvr nhiMsgServer nhiPoller -import nhiArControl nhiDbServer nhiPoller -live nhiCfgServer nhiLiveExSvr 12/13/2001 10:30:40 AM wburke -----Original Message----- From: Burke, Walter Sent: Thursday, December 13, 2001 10:20 AM To: Morley, Shaune Subject: Ticket # 56505 Shaune, The lock errors, due to the fact that they appear only during ingres startup, may be specifically related to the following new application servfers introduced into release 5.0.2 1. Set the NH_DBG_OUTPATH variable in nethealthrc.sh.usr to the following path: $NH_HOME/tmp 2. Edit the $NH_HOME/sys/startup.cfg as follows: - note it may be easier to make a copy of the orig. nhiMsgServer arguments -Dm du:cdb nhiDbServer arguments -Dm cdb:tb:du:dsvr -Df cditp nhiLiveExSvr arguments -Dm cdb:tb:du:dsvr -Df cditp nhiArControl arguments -Dm cdb:tb:du:dsvr -Df cditp i.e. program nhiMsgServer { restart all # Requires complete server restart wait 2 arguments "-Dm du:cdb" } 3) stop and start nhServer 4) wait till lock errors occur. 5) stop server, revert startup.cfg 6) start server. Sincerely, 12/13/2001 1:38:39 PM wburke obtained: BAFS/56505/12.13.01 12/14/2001 9:14:37 AM yzhang Robin, Here is the information collected from customer regarding the logical lock error. I checked with Walter and ProServ. the neServer started properly, and looks nothing wrong on customer. The only thing they see is the error message about logical lock exceed in the errlog whene< ver they start nhServer or run nhReset. I don't see any rollback from the attached two debug file. They did not get the nhiServer debug, because it is not in the start.cfg file. the proserv still working on getting the transaction that causes the message. At the same time I am working with CA regarding why this is happening and how to avoid the message. Let me know for your comment. Thanks Yulun 12/17/2001 9:46:03 AM yzhang Walter, The information you collected last time is not enough, we still need the debug for arController, and the transaction causing the logical lock error. Thanks Yulun 12/17/2001 1:57:47 PM rtrei Reviewed the info provided. THe only thing that looks out of the ordinary (besides all the error msgs in the errlog.log) is that the dbServer tries to set ~11,000 elements to full_duplex. This is suspicious as this was a problem discovered at the end of the release, and I'm wondering if the fix or the original problem could be connected with what this user is seeing. Although the trace file looks like it was sucessful, I have asked Walter to have the customer repeat. If they need to update 11,000 rows again, it will be a strong indication that this is the problem. 12/19/2001 11:07:47 AM rtrei At this point, we believe the problem is with large DCI transaction requests. It looks like we are oging to need to examine this and make some changes in the dbServer to handle large updates. Meanwhile, we may want to increase the lock quote in the config.dat *************8 Walter-- I've talked this through and we believe we will need to make a code change in the nhiDbServer. So we will be sending these customers a one-off as soon as we have made the change and tested it. Meanwhile, now that we understand the problem, we are wondering if it makes sense to explore increasing the lock quota in the config.dat as a possible workaround. I know that we increased it to 1000 with no results. I am wondering if we should try and increase it to 4000 or so. I am opening a ticket with CA about how large this could get to, what trade-offs we are making, so your choice is to go ahead and try with the customer or wait until you hear from me. 12/19/2001 2:54:31 PM wburke OK!. Appears that we have a valid work around: echo "select count(*) from nh_element_aux where full_duplex = 1\g" | sql ehealth obtain count. next cp $II_SYSTEM/ingres/files/config.dat config.dat.orig vi config.dat find ii.$HOSTNAME.rcp.lock.per_tx_limit: 700 set new value of ii.$HOSTNAME.rcp.lock.per_tx_limit: count Ex: echo "select count(*) from nh_element_aux where full_duplex = 1\g" | sql ehealth 8000 set ii.$HOSTNAME.rcp.lock.per_tx_limit: 8000 12/19/2001 4:26:04 PM rtrei This ticket had to do with lock quota being exceeded when the server started up. I had traced this to being connected with full_duplex and-- knowing that there was a full duplex resolution in the upcoming patch-- thought that this would solve the problem. After discussing with Larry we have come to the conclusion that this problem will not go away with the patch and will require changes in the cdb layer. As such we will need to create one-off executables to the customers experiencing this problem and then put the fix out in patch 2. However, as a workaround (until I can get back to write the code in January), we are going to have the customer create the lock quote limit in ingres. Walter has tried this with one customer, upping the limit from 700 to 11,000 and it has made the problem go away. Walter will be letting the rest of tech support know about this workaround. I have checked with CA and there is little direct tradeoffs for increasing this parameter significantly. (We won't run out of memory or anything.) It does increase the transaction time which puts us at risk for performance issues and otherproblems (deadlocks), so this is only a short term solution. Fortunately, I believe we are only seeing this with large poller.cfg sync ups and very large discovers. 1/3/2002 9:21:12 AM rtrei The nh_element_aux table was not getting set for table level locking. Also modified the commit to reset back to session default locking. About to unit test the code. If successful, can send new nhiDbServer and cdbLib to customers for one-off testing. 1/3/2002 8:09:48 PM rtrei requested customer database from tech support for testing. 1/4/2002 10:55:54 AM wburke -----Original Message----- From: Burke, Walter Sent: Friday, January 04, 2002 10:44 AM To: 'bbyerly@uslec.com' Subject: Ticket # 57370 Brian, Engineering has requested a copy of your dB for in-house testing of a code fix for the logical lock problem. you may tar and ftp $NH_HOME/db/save/daily.tdb ftp.concord.com login:anonymous pass:ident bin put A daily save would suffice. Pleae let me know if this is possible. 1/4/2002 1:37:47 PM wburke obtained bafs\57000\57370\db 1/4/2002 1:38:05 PM wburke obtained bafs\57000\57370\db 1/7/2002 1:28:56 PM yzhang Robin, I noticed you made a code change for this one, and now we need to test the change with customer's db. We already has customer's db. do you want to me do the test? If so I need to access you view for nhiDbServer and cdblib. also I need to set the lock limit to 700 for testing. Is this correct? Thanks Yulun 1/7/2002 6:03:19 PM yzhang in house testing 1/10/2002 4:48:27 PM yzhang Walter, Can you grabe nhiDbServer from /export/sulfur3/nh50_s_m/bin/sys, and ship to customer. This was built from the code change made by Robin. I tested with customer db, and did not see the logical lock error. Thanks Yulun 1/11/2002 12:53:36 PM wburke sent the following: Development has created a one-off fix, which will later be patched into the revision. In order to put said fix in place please follow these steps: as $NH_USER. cd $NH_HOME, source nethealthrc.csh nhServer stop su ingres ingstop cd $II_CONFIG edit config.dat change ii.$HOSTNAME.rcp.lock.per_tx_limit: 700 su $NH_USER cd $NH_HOME/bin/sys cp nhiDbServer nhiDbServer.orig ftp ftp.concord.com login: anonymous pass: ident cd outgoing bin get 19342_nhiDbServer nhiDbServer make sure the permissions and ownership of nhiDbServer MATCH nhiDbServer.orig. su ingres ingstart su $NH_USER nhServer start. Everything should come up roses. Sincerely, 1/14/2002 9:59:57 AM wburke -----Original Message----- From: Shaune Morley [mailto:smorley@concord.com] Sent: Monday, January 14, 2002 9:06 AM To: Burke, Walter Subject: RE: Ticket # 56505 - DB Logical Locks Applied one off to rtpcon02, and lock quota was exceeded on eHealth startup. Shaune Morley 1/14/2002 3:29:14 PM yzhang Walter, Last time I forgot to provide you with libCciWscDb.so when running the new nhiDbServer. Can you grabe libCciWscDb.so from ~yzhang/remedy/19342 (by ftp, note this is binary) and have customer replcae this with the original one under $NH_HOME/lib. (backup the original), and use the nhiDbServer you shipped two days ago. change the parameter for logical lock per transaction to 700. Then recycle ingres, and start console, see if the same error will appears. Thanks Yulun 1/15/2002 10:31:37 AM wburke -----Original Message----- From: Shaune Morley [mailto:smorley@concord.com] Sent: Tuesday, January 15, 2002 10:21 AM To: Burke, Walter Subject: RE: Ticket # 56505 - DB Logical Locks Ok, put in the lib*.so online and it all looks good. I only have this on rtpcon02 right now, and will monitor it. 1/15/2002 10:43:49 AM apier De-escalated. One off fixed the problem. 1/18/2002 5:49:17 PM yzhang checked in the code: the change was to set nh_element_aux table to be table level locking. Also modified the commit to reset back to session default locking 11/20/2001 5:21:01 PM jpoblete Customer: ATT Customer is trying to perform a DB save is ASCII, it fails with the following error: Fatal Internal Error: Ok. (none/) Below is the whole save log: Begin processing (11/19/2001 13:33:48). (< dbu/DbuSaveDbApp::run) Copying relevant files (11/19/2001 13:33:49). (dbu/DbuSaveDbApp::run) Unloading the data into the files, in directory: '/logs/con_ascii.tdb/'. . . Unloading table nh_active_alarm_history . . . Unloading table nh_active_exc_history . . . Unloading table nh_alarm_history . . . Unloading table nh_alarm_rule . . . Unloading table nh_alarm_threshold . . . Unloading table nh_alarm_subject_history . . . Unloading table nh_bsln_info . . . Unloading table nh_bsln . . . Unloading table nh_calendar . . . Unloading table nh_calendar_range . . . Unloading table nh_assoc_type . . . Unloading table nh_elem_latency . . . Unloading table nh_element_class . . . Unloading table nh_elem_outage . . . Unloading table nh_elem_alias . . . Unloading table nh_element_ext . . . Unloading table nh_elem_analyze . . . Unloading table nh_enumeration . . . Unloading table nh_exc_profile_assoc . . . Unloading table nh_exc_subject_history . . . Unloading table ex_tuning_info . . . Unloading table exception_element . . . Unloading table exception_text . . . Unloading table nh_exc_profile . . . Unloading table ex_thumbnail . . . Unloading table nh_exc_history . . . Unloading table nh_list . . . Unloading table nh_list_item . . . Unloading table hdl . . . Unloading table nh_elem_latency . . . Unloading table nh_col_expression . . . Unloading table nh_element_type . . . Unloading table nh_le_global_pref . . . Unloading table nh_elem_type_enum . . . Unloading table nh_elem_type_var . . . Unloading table nh_variable . . . Unloading table nh_mtf . . . Unloading table nh_address . . . Unloading table nh_node_addr_pair . . . Unloading table nh_nms_defn . . . Unloading table nh_elem_assoc . . . Unloading table nh_job_step . . . Unloading table nh_list_group . . . Unloading table nh_run_schedule . . . Unloading table nh_run_step . . . Unloading table nh_job_schedule . . . Unloading table nh_system_log . . . Unloading table nh_step . . . Unloading table nh_schema_version . . . Unloading table nh_stats_poll_info . . . Unloading table nh_import_poll_info . . . Unloading table nh_protocol . . . Unloading table nh_protocol_type . . . Unloading table nh_rpt_config . . . Unloading table nh_rlp_plan . . . Unloading table nh_rlp_boundary . . . Unloading table nh_stats_analysis . . . Unloading table nh_subject . . . Unloading table nh_schedule_outage . . . Unloading table nh_element . . . Unloading table nh_deleted_element . . . Unloading table nh_var_units . . . Unloading the sample data . . . Fatal Internal Error: Ok. (none/) Advanced logging for nhiSaveDb, it is on the Call Ticket Directory\Nov20 11/21/2001 9:41:03 AM jpoblete DbCollect.tar has been collected from customer, it's on the Call Ticket directory\Nov20 11/21/2001 10:21:09 AM yzhang Jose, find out if there is physical file for this stats0 table, which causes the save problem echo " select * from iifile_info where table_name = 'nh_stas0_1005940799'\g | sql nethealth recycle ingres, then remove entries fron nh_rlp_boundary for nh_stas0_1005940799, do the ascii save again. Yulun 11/28/2001 1:05:10 PM jpoblete Yulun, After followed your instructions, the DB save worked fine. Please close. -JMP 12/3/2001 10:48:43 AM yzhang problem solved 11/21/2001 2:53:39 PM wburke $NH_HOME/bin/sys/nhiDialogRollup -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (11/21/2001 12:05:07 PM). Error: Append to table nh_dlg1b_1005800399 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 24 rows not copied because duplicate key detected. The reason for this ticket is this is the 6th ticket for this customer with this problem. -----Original Message----- From: Burke, Walter Sent: Tuesday, November 20, 2001 5:51 PM To: Zhang, Yulun Subject: FW: Ticket # 56608 - dlg1 rollup failure Yulun, BEar sterns has another dialogRollup failure Following tickets have happened on this since May '01 48441 49427 53682 54961 55801 56608 con-ta1% nhiDialogRollup -now 11/26/01 Begin processing (11/20/2001 05:16:30 PM). Error: Append to table nh_dlg1b_1005800399 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 24 rows not copied because duplic ate key detected. ). 11/21/2001 4:07:38 PM yzhang get the followings: 1) echo "select table_name, num_rows, create_date from iitables order by table_name\g" | sql nethealth > table.out 2) echo " copy table nh_dlg0_1005800399() into 'nh_dlg0_1005800399.dat'\g" | sql nethealth 3) echo " copy table nh_dlg1b_1005800399() into 'nh_dlg1b_1005800399.dat'\g" | sql nethealth after collecting the above: 4) copy table nh_dlg1b_1005800399() from 'nh_dlg0_1005800399.dat'\g" | sql nethealth > append.out Thanks Yulun 11/21/2001 5:13:26 PM wburke -----Original Message----- From: Ryan, Michael (Exchange) [mailto:michael.ryan@bear.com] Sent: Wednesday, November 21, 2001 4:29 PM To: 'Burke, Walter'; 'michael.ryan@bear.com' Subject: RE: 19365/55608 (escalated) Step 1) <> BAFS/56608/11.21.01/table.out Step 2) con-ta1% echo " copy table nh_dlg0_1005800399() into 'nh_dlg0_1005800399.dat'\g" |sql nethealth INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. Ingres SPARC SOLARIS Version OI 2.0/9712 (su4.us5/00) login Wed Nov 21 16:21:51 2001 continue * Executing . . . (2400 rows) continue * Your SQL statement(s) have been committed. Ingres Version OI 2.0/9712 (su4.us5/00) logout Wed Nov 21 16:21:52 2001 Step 3) con-ta1% echo " copy table nh_dlg1b_1005800399() into 'nh_dlg1b_1005800399.dat'g" |sql nethealth INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. Ingres SPARC SOLARIS Version OI 2.0/9712 (su4.us5/00) login Wed Nov 21 16:23:14 2001 continue * Executing . . . (17852 rows) continue * Your SQL statement(s) have been committed. Ingres Version OI 2.0/9712 (su4.us5/00) logout Wed Nov 21 16:23:15 2001 Step 4) ***someone forgot the.... echo " in the beginnng of the command... con-ta1% more append.out INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. Ingres SPARC SOLARIS Version OI 2.0/9712 (su4.us5/00) login Wed Nov 21 16:24:59 2001 continue * Executing . . . E_CO003F COPY: Warning: 1663 rows not copied because duplicate key detected. E_CO0028 COPY: Warning: Copy completed with 1 warnings. 1337 rows successfully copied. (1337 rows) continue * Your SQL statement(s) have been committed. Ingres Version OI 2.0/9712 (su4.us5/00) logout Wed Nov 21 16:25:04 2001 11/21/2001 5:40:53 PM wburke con-ta1% pwd /con0R3/health con-ta1% nhiDialogRollup Begin processing (11/21/2001 04:57:37 PM). Error: Append to table nh_dlg1b_1005800399 failed, see the Ingres error log file for more information (E_CO003F COPY: Warning: 24 rows not copied because duplic ate key detected. ). 11/26/2001 10:49:56 AM yzhang following tables need to be droped and the entries in nh_rlp_boundary need to be updated: please modify drop_dlg.sh in ~yzhang/scripts, and test the sctipt, send me the script piror to sending to customer. Don Walter is out today, Can you have somebody take over, Jason or Jose can be considered nh_dlg0_1005728399 nh_dlg0_1005742799 nh_dlg0_1005757199 nh_dlg0_1005771599 nh_dlg0_1005785999 nh_dlg0_1005800399 nh_dlg0_1005814799 nh_dlg0_1005829199 nh_dlg0_1005843599 nh_dlg0_1006347599 nh_dlg0_1006390799 nh_dlg0_1006405199 nh_dlg0_1006419599 11/26/2001 11:38:24 AM cestep Sent the script. Waiting for results. 11/29/2001 10:51:29 AM ekarten Now assigned to Eric K. On 11/26 before the re-assignment I sent Colin to the following < email. Yulun asked me to help with this. I looked at the system logs and I see that several polls are kicking off late. I need to know what the poll interval is for this customer, how many probes they are polling, how many over frame relay and print perf data for three poll intervals. My theory is that data is being polled multiple times because of the time outs. 12/5/2001 4:24:02 PM don database is on ftp.concord.com/incoming/56608.tdb.tar 12/6/2001 2:19:18 PM ekarten Db is corrupt. Please try to retrieve it again. 12/6/2001 2:28:37 PM cestep requested the database again. 12/7/2001 12:00:15 PM cestep Received the new tar file and extracted it successfully. The tar file and the folder are on BAFS, under ticket #56608/12.7.01 12/10/2001 2:48:56 PM ekarten Unable to load the dB. Possibly missing components. Asked Colin to try. 12/12/2001 3:38:57 PM cestep I was able to load the new database. It's on BAFS, under ticket #56608/12.12.01 the tar file is 1212-56608.tdb.tar. Changing to assigned. 12/13/2001 10:36:38 AM ekarten This dB is clean, too clean. Rollups are not failing. Please ask the customer to save the dB when a rollup fails. You'll need to load the dB when you get it and then run a roll up. It must fail or I can't do anything with it. 12/13/2001 10:36:55 AM ekarten This dB is clean, too clean. Rollups are not failing. Please ask the customer to save the dB when a rollup fails. You'll need to load the dB when you get it and then run a roll up. It must fail or I can't do anything with it. 12/14/2001 9:58:59 AM cestep Got another server with the same problem. Received a database save from that server, but was unable to load successfully. Waiting for another one, having the customer test it before sending. I will load it here, and test to be sure it fails this time. 12/21/2001 2:07:38 PM schapman I am going to de-escalate this issue until we receive the database from the customer. 1/22/2002 3:45:43 PM ekarten It has been a month since requesting further info. I am closing this bug. 11/21/2001 7:11:23 PM wburke Conversation rollups fail to complete. Just runs forever... Wants to save as much data as possible. - always had problem with way too many nodes. - currently unable to save database. - check aaaabaap for size. (nh_element) - ls -l aaaab* - 287744000 - 2 gb limit is not the problem.; - -rwx------ 1 ingres staff 287744000 Nov 21 11:16 aaaabaap.t00 - ran nhDbStatus: - 153 probes - 2,246,199 nodes set NH_UNREF_NODE_LIMIT= 4 leave NH_POLL_DLG_BPM=1000 - Settings:As Poller Conversations = 5days 4hr samples = 1wk 1day sample= 2wks 1wk sample= 4wks Rollup Top Conversations: default settings adv.logging for nhiRollupDb on BAFS/55252/11.21.01 11/21/2001 7:14:40 PM wburke Need to collect $NH_HOME/tmp/dbCollect.tar This is obtained by running $NH_HOME/bin/nhCollectCustData 11/26/2001 9:39:30 AM yzhang waiting for nhColectCustData 11/27/2001 8:37:45 AM cestep We have received the CollectCustData. It's on BAFS, under ticket #55252/11.27.01 12/3/2001 11:39:30 AM yzhang The problem is that the regular db save failed on nh_element table, have customer try the following: 1) echo "select file_name from iifile_info where table_name = 'nh_element'\g" | sql nethealth 2) find the size of this file_name from physical database 3) see if they can just save nh_element table into file: echo"copy table nh_element() into 'nh_element.dat'\g | sql nethealth, If failing on this copy command, get the error message. Thanks Yulun 12/3/2001 11:56:31 AM cestep E-mailed procedure to customer to obtain requested information. Awaiting reply. 12/4/2001 4:42:29 PM wburke fdavux2% echo "copy table nh_element () into 'nh_element.dat' \g" | sql nethealth INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. Ingres SPARC SOLARIS Version II 2.0/9808 (su4.us5/00) login Tue Dec 4 13:26:04 2001 continue * Executing . . . E_CO003B COPY: Error writing to Copy File while processing row 2153946. E_CO0029 COPY: Copy terminated abnormally. 2153945 rows successfully copied. continue * Ingres Version II 2.0/9808 (su4.us5/00) logout Tue Dec 4 13:29:14 2001 fdavux2% 12/5/2001 1:51:21 PM wburke Rest of info sent to Yulun, BAFS/55252 12/10/2001 11:38:03 AM wburke -----Original Message----- From: Burke, Walter Sent: Monday, December 10, 2001 11:28 AM To: Zhang, Yulun Subject: 19376 Anything? 12/10/2001 6:22:31 PM wburke -----Original Message----- From: Zhang, Yulun Sent: Monday, December 10, 2001 6:07 PM To: Burke, Walter Subject: RE: 19376 I have been working with customer this afternoon. now his ascii save on nh_element table was succeeded. after he back up nh_element table, he can do ascci load for nh_element table 12/12/2001 11:37:18 AM yzhang I guess this is because the drop nh_element has not been commited. drop nh_element table again, then commit. if this succeed, run the copy in. be sure don't remove anything from nh_home directory 12/16/2001 3:40:39 PM yzhang It is 19376, can you send me the error message and echo "help\g" | sql nethealth. I want to make sure that the problem is the rollup was trying to insert duplicate into stats1 table and there is no duplicate in any of the stats0 table, before I pass this to Dave. Thanks Yulun 12/16/2001 5:48:46 PM yzhang Dave, This is a rollup fail because db rollup is trying to insert duplicate into stats1 table. There is no duplicate in any of the stats0 table before db rollup. I think this one is very similar to 11800, that you fixed last time. Walter has the database. I re-assign this one to you as we talked last week. Thanks Yulun 12/17/2001 10:46:29 AM dshepard Sent to Yulun: The first part of this ticket is all about Dialog rollups, not Stats rollups. Which is correct? I will change the Short Description if it is really stats tables. Where is the data? 12/17/2001 11:21:36 AM yzhang Walter and Shelden Can put the following into a file as a script afetr making the change I recommanded, then run the script. afetr running the script, run db rollup, let me when you see the duplicate message. ------------------------------------------------------------------------- #!/bin/sh sql -u$NH_USER $NH_RDBMS_NAME << EOF | grep -v continue |grep -v '\* Executing' create table nh_stats1_992059199 (change this table name to one of the stats1 table in the db) as select * from nh_stats1_992145599 where 1=2 ;\g commit;\g CREATE INDEX nh_stats1_992059199_ix1 on nh_stats1_992059199 (sample_time, elemen t_id) WITH STRUCTURE = BTREE, FILLFACTOR = 100, LEAFFILL = 100, NONLEAFFILL = 10 0; \g commit; \g CREATE INDEX nh_stats1_992059199_ix2 on nh_stats1_992059199 (element_id, sample_ time) WITH STRUCTURE = BTREE, FILLFACTOR = 100, LEAFFILL = 100,NONLEAFFILL = 100 ;\g commit; \g EOF 12/18/2001 10:07:02 AM wburke All yulun did was a copy out/in of nh_element. Customer called this morning to say his $II_SYSTEM/data/default/nethealth direcotry was missing...... AE may have blown away. Calling customer. 12/18/2001 10:16:51 AM yzhang customer called this morning, seeing he could not start nhServer. I have him do sql nethealth, he got no database with this name exist. he checked that there is no physical nethealth database. I don't why his database got lost. a few days ago, I have him do copyout and copyin for nh_element table. He told me the copy succeeded. this copyout and copyin should not cause the database get lost. Walter is now working with customer 12/26/2001 10:00:47 AM wburke -----Original Message----- From: Keith.Stuart@fluor.com [mailto:Keith.Stuart@fluor.com] Sent: Friday, December 21, 2001 7:59 PM To: Burke, Walter Subject: RE: Ticket # 55252 - TA issue. Walter, Hope you have a good Holiday......Thanks for all the input you have put into this problem. As anticipated the script ended with an error....see the following text for details. Thank you, Keith Stuart NMS FSS 949.609.9795 < Ingres Version II 2.0/9808 (su4.us5/00) logout Fri Dec 21 17:29:12 2001 + echo copy table nh_element () from '/data01/NH/db/save/save.tdb/smt_b452'; commit\g + sql nethealth INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. Ingres SPARC SOLARIS Version II 2.0/9808 (su4.us5/00) login Fri Dec 21 17:29:13 2001 continue * Executing . . . E_QE007D Error trying to put a record. (Fri Dec 21 18:54:27 2001) E_US16D2 COPY: Copy has been aborted. (Fri Dec 21 18:54:27 2001) ________________________________________________________ Unable to copy nh_element table back into dB. At this point there is nothing more we can do. 1/15/2002 4:23:33 PM yzhang I noticed from description that you still have trubble on copying nh_element table back into database, the error is "error on putting record". Let work on this problem again, here is my question for either of you 1) what is the current status 2) Is this a transaction log, 2GB problem? 3) what is the message on errlog.log 4) do you have a short of disk space problem 5) how many elements in nh_element table Thanks Yulun 2/6/2002 2:41:58 PM don customer restored older Db Call tricket closed closing bug ticket ?11/26/2001 9:11:13 AM beta program PWC: PriceWaterhouse Coopers Don Mount donald.d.mount@us.pwcglobal.com 813-348-7252 Conversation Rollup Failure every 4 hours. Conversations_Rollup.100001.log ----- Job started by User at `11/26/2001 04:05:03 AM`. ----- ----- $NH_HOME/bin/sys/nhiDialogRollup -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing 11/26/2001 04:05:04 AM. Error: Unable to execute `CREATE TABLE nh_dlg1b_1006232399(sample_time INTEGER NOT NULL, nap_id INTEGER NOT NULL, proto_id INTEGER NOT NULL, delta_time INTEGER NOT NULL, bytes NUMBER, packets NUMBER) TABLESPACE NH_DATA06 PCTFREE 1 STORAGE (initial 397725600 NEXT 3 9772560 minextents 1 maxextents UNLIMITED pctincrease 1) ` (ORA-01237: cannot e xtend datafile 12 ORA-01110: data file 12: `/opt/co). ----- Scheduled Job ended at `11/26/2001 04:05:29 AM`. ----- 11/27/2001 4:07:30 PM mfintonis From the error, it looks like one of the tablespaces ran out of room. I/d like to try out our new nhCollectCustData script (which is going out in beta 2). Please run the script as nh_user with your nethealthrc.csh file sourced. nhCollectCustData.sh It will create a tar file which can be ftp'd to our site. In addition, please do a df -k > disk.out so that we can match the problem tablespace to the disk it is on. -Robin 11/27/2001 5:18:48 PM wzingher Customer ran out of disk space. Marking closed. The disk space issue has been bulletproofed. 11/27/2001 5:19:32 PM wzingher Additional info, the tablespace was autoextending and that has been fixed so that it will not occur - fixes in dbmaint. 1/21/2002 4:21:46 PM beta program marking Nobug since this was actually caused by not having enough disk space 2/26/2002 10:40:49 AM Betaprogram customer verified  11/26/2001 3:36:03 PM rkeville Customer has SPVD installed, report front end is not working. (Rev 5.0 Beta6) - Tom Manes was flown out to work on this issue, he discovered problems with the database. - nhServer dies, seeing wierd message in log file: - "November 26, 2001 03:12:10 PM Fatal Internal Error nhiCfgServer Pgm nhiCfgServer: Call 'cdbFillElements' to database API failed. dbs/DbsMsgHandler::getElementsCCb)" - Ingres reports the following messages: - REHJAO01::[42217, 00006218]: Mon Nov 26 10:48:23 2001 E_DM93A7_BAD_FILE_PAGE_ADDR Page 15608 in table , owner: $ingres, database: nethealth, has an incorrect page number: 0. Other page fields: page_stat 00000000, page_log_address (00000000,00000000), page_tran_id (0000000000000000). Corrupted page cannot be read into the server cache. REHJAO01::[42217, 000063aa]: Mon Nov 26 10:48:23 2001 E_DM9206_BM_BAD_PAGE_NUMBER Page number on page doesn't m. - Destroyed nethealth dbase, created nethealth dbase, init 6 - nh appears to be back up and running. - Errors occured during creation of a view in the database. - Mon Nov 26 17:58:35 2001 : createViews : error : failed to drop nhv_element_nameE_US125C Deadlock detected, your single or multi-query transaction has while: drop view nhv_element_name \g Mon Nov 26 17:58:35 2001 : createViews : error : failed to create nhv_element_nameE_US125C Deadlock detected, your single or multi-query transaction has while: drop view nhv_element_name \g - Requested the following: - install log - errlog.log - eHealth syslog file - nhCollectCustdata - We will need a pointer to the code or scripts that are causing this problem. - List the timezones and the operating systems these issues are happening in. - core files. ############################################################### 11/27/2001 9:46:42 AM rtrei This is probably going to be a no problem, awaiting confimation with TS Bob, It looks like we went from 0 to engineering a little too quickly on this ticket. All of the eHealth services were supposed to have been disabled as a part of the SPV-D implementation, it looks like they were missed and caused the locking issue. I've disabled everything except the message and db servers and it looks like the cluster update is running fine now. I'll verify things are still ok in the morning and follow up with you to close this ticket. Thanks, proserv owes you a beer! Tom 11/28/2001 10:22:16 AM rkeville -----Original Message----- From: Manes, Thomas Sent: Wednesday, November 28, 2001 12:28 AM To: Keville, Bob Subject: RE: Equant RFE Issues Bob, I think we are ok,if you want to close the ticket. Thanks again, Tom ################################################### 11/28/2001 12:13:30 PM mfintonis > From: Fintonis, Melissa > Sent: Wednesday, November 28, 2001 11:29 AM > To: Trei, Robin; Keville, Bob > Subject: 19394 > > so can the problem ticket be closed as well or is there still > and issue we need to fix? > thanks! -----Original Message----- From: Trei, Robin Sent: Wednesday, November 28, 2001 11:37 AM To: Fintonis, Melissa Subject: RE: 19394 close it 11/27/2001 5:22:47 PM rrick Problem: Customer claims after executing one of the help buttons on the poller screen and then executing a Conversational Rollup they get the following error when executing an nhDbStatus from the GUI: Unable to execute 'MODIFY nh_dlg0_1005857999 TO BTREE UNIQUE ON sample_time, nap_id, dlg_src_id, proto_id WITH FILLFACTOR = 100, LEAFFILL = 100, NONLEAFFILL = 100' (E_RD002A Deadlock has occurred on system catalogs (Thu Nov 15 15:22:57 2001) ). Conversational Rollup runs fine. A nhDbstatus from the command line also produces a deadlock error, as well. This has happenned before with this customer....different table. The way it was corrected was to bump up stack size to double default, and provide a new reporting script. Please reference ticket #18573 for additional details. All new files located in bafs/escalated tickets/56000/56894/11.27.01 11/28/2001 3:02:04 PM rrick -----Original Message----- From: Trei, Robin Sent: Wednesday, November 28, 2001 10:00 AM To: Rick, Russell Subject: 19436 Tuesday, November 27, 2001 5:22:47 PM rrick Problem: Customer claims after executing one of the help buttons on the poller screen and then executing a Conversational Rollup they get the following error when executing an nhDbStatus from the GUI: Unable to execute 'MODIFY nh_dlg0_1005857999 TO BTREE UNIQUE ON sample_time, nap_id, dlg_src_id, proto_id WITH FILLFACTOR = 100, LEAFFILL = 100, NONLEAFFILL = 100' (E_RD002A Deadlock has occurred on system catalogs (Thu Nov 15 15:22:57 2001) ). I'm not sure what you mean by above. Is this is a repeatable set of circumstances, or is it just the steps they remember they did when this happened? Conversational Rollu< p runs fine. You should make the nh_dlg0 table a btree if it hasn't already rolled up. A nhDbstatus from the command line also produces a deadlock error, as well. This has happenned before with this customer....different table. The way it was corrected was to bump up stack size to double default, and provide a new reporting script. Please reference ticket #18573 for additional details. All new files located in bafs/escalated tickets/56000/56894/11.27.01 Talk with Bob Keville regarding an advisory we did on nhDbStatus from the command line. In addition, we will be putting a patched nhiReport out with the next patch which will help in this area. I will be discussin this with Jay to determine what further work we need to do. -----Original Message----- From: Trei, Robin Sent: Wednesday, November 28, 2001 2:25 PM To: Rick, Russell Cc: Keville, Bob Subject: RE: 19436 that is a weird one indeed. Can someone try to reproduce this in house (the hel button part I mean.) Please check his conv rollup log to be sure it is completing successfully. 12/3/2001 4:41:52 PM mwickham -----Original Message----- From: Rick, Russell Sent: Monday, December 03, 2001 04:29 PM To: ts_esc_leads Subject: 56894 Please escalate per Yulun. 12/4/2001 8:02:13 PM rrick Spoke with Yi Jia: - Poller intervals are as follows: fast 1 minute medium 5 minutes slow 30 minutes conversations 15 minutes - Timeouts are default. 12/7/2001 10:20:19 AM rtrei Don is investigating if this really needs ot be escalated. At some point, we will likely want to make changes in apatch, but first I need more infor (as requested above) and I want to see how the changes we have already made (in 4.8 and 5.0 ) play out. 12/10/2001 11:40:15 AM mfintonis Don is investigating if this really needs ot be escalated. At some point, we will likely want to make changes in apatch, but first I need more infor (as requested above) and I want to see how the changes we have already made (in 4.8 and 5.0 ) play out. 12/10/2001 12:51:15 PM rrick Cannot reproduce in-house, so far. 12/11/2001 1:20:21 PM aholmansky This has nothing to do with Cisco WAN Manager IM as far as I can tell. 12/11/2001 7:25:01 PM rrick . 1/7/2002 8:57:04 PM rtrei I believe we are going to try and put the 5.0 deadlock changes in patch 2 or 3. THat is about a weeks worth of work. Currently setting this field for patch 3 because of that. 1/7/2002 8:58:01 PM rtrei Sorry make that patch 10 or 11. Setting to patch 10 as that is my current choice. 2/8/2002 2:33:49 PM rdiebboll After reviewing the checkin history and the code history, here's what needs to be changed for this problem ticket. While the code changes should only take a day (max) to do, the testing will take several days given the list of executables affected (see list below). The testing elapsed time should span several weeks using a database setup with stats and probe elements, scheduled reports, and live exceptions. The Rollup schedule should be setup aggressively and the logs monitored closely. This is a big issue for me since I'll be doing other development in parallel and I won't be able to run two versions of eHealth on the same machine. It would be ideal to get another test machine to run this 4.8 longevity testing, or even better to get SQA to do it. CODE CHANGES: CdbTblElemAnalyze.sc Use cdbGetDdlLock () CdbTblRlpBoundary.sc Use cdbGetDdlLock () CdbTblRptConfig.sc Use cdbGetDdlLock () CdbTblsDlg.C Use cdbGetDdlLock () CdbTblsStats.C Use cdbGetDdlLock () cdbUtils.C: Define "Bool cdbGetDdlLock (DuDatabase* db)" cdb.H Add wscSitSerializeSqlDdl DbuCalcBaseline.C Use cdbGetDdlLock () LexDbEventQueue.C/.H Use cdbGetDdlLock () [Note: this was later removed - see piranha/39-40, and review piranha/beta5/2-4] wsc.C/.H Add NH_SERIALIZE_SQL_DDL CdbSystem.sc Investigate for use by nhiDbStatus 4.8 PATCH 'InstallTree.dat', and TEST TARGETS: * nhiCheckDupStats (need this too?) * nhiDataAnalysis * nhiPoller * nhiLiveExSvr ** nhiRollupDb ** nhiDialogRollup ** nhiCalcBaseline ** nhiIndexStats * Already in patch ** Must add to patch 2/22/2002 3:33:52 PM rkeville I have added ticket 57521 to this issue. 4/1/2002 4:42:14 PM yzhang Talked to Bob in support, customer will add catfish patch 9 where the fix for stack dump and fix for some other ingres related problems exist. Then watch to see if the same error come back. 4/4/2002 2:30:39 PM yzhang one of the call ticket 57521 associated with this problem ticket has been closed 4/8/2002 2:27:09 PM yzhang This problem ticket is closed because all of the call tickets have been closed, 811/29/2001 4:34:49 PM knewman Scheduled Health reports did not import during nhLoadDb 11/29/2001 4:41:40 PM jpoblete Database Load did not showed any error. Please advice on what do you need to figure this out. 11/29/2001 5:46:07 PM yzhang Let's get some information: 1) save.log 2) load.log 3) nhSchedule -list > schedule.out (for current db) 4) check with customer regarding which schedule job or jobs were missing, Hwo the noticed this? Thanks Yulun 12/4/2001 3:46:33 PM jpoblete The info requested is on the call ticket directory... 12/4/2001 3:49:04 PM jpoblete From customer: I have gone through and checked the web interface for the scheduled health reports that are supposed to be generated on a weekly basis. They were not generated on sunday night. Twenty two reports should have been generated for web access only. These reports are not sent to anyone. Have you been able to determine anything? I am getting to the point where I will need to go in and rebuild the reports that are missing. I just didn't want to make any changes until you had a chance to go over everything, however by thursday I will have to make the changes. One more thing ... The top of the job scheduler in the console does not show health reports as an option under, list jobs by application. I did have some licensing issues when I first reloaded the db onto this server, which I though were solved with the new license keys that were sent out. Jim 12/5/2001 11:02:31 AM yzhang customer got report license, and ticket closed 12/4/2001 3:05:28 PM wburke From stats index log: Begin processing (04/12/2001 18:20:09). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Tue Dec 4 12:25:56 2001) From stats rollup log: Begin processing (03/12/2001 21:00:12). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Mon Dec 3 15:06:50 2001) The indexDiag.out shows 1000+/- duplicates in alot of different tables. All tables shown range from 7 to 12 duplicates. Escalating per Db t-shooting doc per: "If the output from the nhiIndexDiag indicates only a few duplicates (less than 100) in the tables, run the cleanStats script. If the output contains thousands of duplicates, escalate the problem and send to engineering to consult before proceeding further." BAFS/57155 12/4/2001 3:06:39 PM wburke - Should I still run cleanStats? 12/4/2001 3:18:30 PM wburke -----Original Message----- From: Burke, Walter Sent: Tuesday, December 04, 2001 3:07 PM To: 'support@smtware.com' Subject: Ticekt # 57155 - StatTable Duplicates Mario, Please run the attached script as $NH_USER in $NH_HOME. << File: cleanStats_mod.sh >> ./cleanStats_mod.sh clean then run nhiIndexDb then nhRollupDb Sincerely, ____________________________________________________________ Spoke with Cestep, and Yulun's script: cleanStats_mod.sh should work. 12/5/2001 9:29:01 AM wburke -----Original Message----- From: Dirk Van de Walle [mailto:Dirk.VandeWalle@qconsulting.be] Sent: Wednesday, December 05, 2001 7:48 AM To: 'SMT - Support'; Support; 'wburke@concord.com' Cc: 'support@concord.com'; 'pernsten@concord.com'; 'SMT - Mario Robers' Subject: Ticekt # 57155- SMTI-00263 FW: - S< tatTable Duplicates Walter, I did all the requested actions. => problem solved at December 5th ,02:00 CET, you can close the case. Thanks for the quick response/solve of this case 12/10/2001 4:03:50 PM rrick Problem: Statistical Rollups failing with the following error: ----- Job started by Scheduler at '12/04/2001 12:18:52 AM'. ----- ----- $NH_HOME/bin/sys/nhiRollupDb -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (12/04/2001 12:18:53 AM). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Tue Dec 4 01:33:15 2001) ). ----- Scheduled Job ended at '12/04/2001 12:33:24 AM'. ----- Cleanup all duplicates. All stats tables contain indexes. Re-indexed the db Copy of files and their *.tdb is located on bafs/escalated tickets/56000/56941. 12/10/2001 5:38:50 PM yzhang Here is what I know from the information you posted: 1) their db size is 10G, their free space is about 11G, looks they still need more space. did they add space? 2) you said you have run nhCleanStats_mod, but there is table called nh_stats0_1007751599 still contain dupliacte. did you run the script with clean argument 3) they keep too much stats0 table, check to see if they can set the length of rollup to default 4) get the following: echo"select sample_time, element_id, count (*) from nh_stats0_1007751599 group by sample_time, element_id having count (*) > 1\g" | sql nethealth >> stats0_dup.out1 echo "select element_id, sample_time, count (*) from nh_stats0_1007751599 group by element_id, sample_time having count (*) > 1\g" | sql nethealth >> stats0_dup.out2 Thanks Yulun 12/11/2001 12:52:24 PM rrick Sent Sql queries to Yulun. 12/12/2001 2:37:18 PM yzhang there is no duplicate in the existing stst0 table, the duplicate problem is caused from Db rollup.the only thing we can do in this case is to obtain customer's db, and do the rollup test in house, but be sure load customer database into the same platform, same nethealth version and same envirnment. Yulun 12/18/2001 1:17:55 PM jpoblete Reproduced the problem: Kyle is running eHealth 4.8 P07Started the DB load... Monday, 12/17/2001 05:08:02 PM System Event Starting database load . . . Tuesday, 12/18/2001 04:56:25 AM System Event Database load complete. Load finished OK: Loading the Dac tables . . . Creating the Table Structures and Indices . . . Creating the Table Structures and Indices for sample tables . . . Granting the Privileges . . . Granting the Privileges on the sample tables . . . Load of database 'nh48' for user 'neth' completed successfully. End processing (12/18/2001 04:56:24 AM). Started a manual Rollup, which failed: neth@kyle% nhiRollupDb Begin processing (12/18/2001 10:46:12 AM). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Tue Dec 18 11:01:48 2001) ). neth@kyle% This is the same error from customer's log: ----- Job started by Scheduler at '12/04/2001 12:18:52 AM'. ----- ----- $NH_HOME/bin/sys/nhiRollupDb -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing (12/04/2001 12:18:53 AM). Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Tue Dec 4 01:33:15 2001)). ----- Scheduled Job ended at '12/04/2001 12:33:24 AM'. ----- The Database is loaded on kyle, login as neth/neth and choose option 3 upon login to set eHealth 4.8 settings. 12/18/2001 4:42:51 PM jpoblete I ran the rollup in debug mode: neth@kyle% nhiRollupDb -Dall -Dt & Begin processing (12/18/2001 03:33:18 PM). (dbu/DbuRlpDlgApp::run) Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Tue Dec 18 16:22:39 2001)). (dbu/DuTable::createIndex) Advanced logging file showed the failure while indexing table: nh_stats1_1005886799 12/18/01 16:22:19 [Z,du ] (dbExecSql): sqlCmd: CREATE UNIQUE INDEX nh_stats1_1005886799_ix1 ON nh_stats1_1005886799 (sample_time, element_id) WITH STRUCTURE = BTREE, FILLFACTOR = 100, LEAFFILL = 100, NONLEAFFILL = 100 12/18/01 16:22:39 [Z,du ] (dbExecSql): sqlca.sqlcode: -33000 12/18/01 16:22:39 [Z,du ] (dbExecSql): rows: 0 12/18/01 16:22:39 [Z,du ] sqlErrorCode: -33000 12/18/01 16:22:39 [Z,du ] sqlErrorText: E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Tue Dec 18 16:22:39 2001) 12/18/01 16:22:39 [Z,du ] sqlErrorMsg : -33000, E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Tue Dec 18 16:22:39 2001) 12/18/01 16:22:39 [Z,du ] sqlca.sqlcode: -33000 12/18/01 16:22:39 [Z,du ] rows: 0 12/18/01 16:22:39 [d,du ] Cmd complete, SQL code = 1000000 12/18/01 16:22:39 [Z,du ] sqlca.sqlcode: -33000 12/18/01 16:22:39 [Z,du ] rows: 0 12/18/01 16:22:39 [Z,du ] sqlErrorCode: -33000 12/18/01 16:22:39 [Z,du ] sqlErrorText: E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Tue Dec 18 16:22:39 2001) 12/18/01 16:22:39 [Z,du ] sqlErrorMsg : -33000, E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Tue Dec 18 16:22:39 2001) 12/18/01 16:22:39 [d,du ] End transaction level 3 12/18/01 16:22:39 [d,du ] End transaction level 2 12/18/01 16:22:39 [d,du ] Rolling back database transaction. 12/18/01 16:23:34 [d,du ] End transaction level 1 12/18/01 16:23:34 [d,cdb ] Stats read: 2578509 12/18/01 16:23:34 [d,cba ] Exit requested with status = 1 12/18/01 16:23:34 [d,cba ] Exiting ... 12/18/01 16:23:34 [d,du ] Disconnecting from db: nh48, user: neth, handle: [0xffbef4a8] ... 12/18/01 16:23:34 [d,du ] Disconnected. 1/8/2002 1:48:55 PM yzhang Can you request the database if we don't have one. Thanks Yulun 1/8/2002 1:55:05 PM jpoblete -----Original Message----- From: Poblete, Jose Sent: Tuesday, January 08, 2002 1:41 PM To: Zhang, Yulun Cc: Gray, Don Subject: RE: 19689 I have it... The Db is available from: ftp://neth:neth@kyle.concord.com/nh48/db/save/56941.tdb.tar 1/8/2002 2:03:00 PM yzhang please load customer db on solaries 4.8, then run stats rollup using following command: nhiRollupDb -Dall -d $NH_RDBMS_NAME -U $NH_USER then save the last 3000 line in a file, let me know when you done this. Thanks Yulun 1/8/2002 2:03:59 PM jpoblete Yulun, I have already done that, please look at the advanced logging file in the call ticket directory. 1/9/2002 11:20:32 AM jpoblete -----Original Message----- From: Poblete, Jose Sent: Wednesday, January 09, 2002 11:06 AM To: Zhang, Yulun Subject: 19689 Yulun, Here is the file you requested.... ftp://neth:neth@kyle.concord.com/nh48/log/advanced/ List the files, open: nhiRollupDb_dbg.txt kyle% $NH_HOME/bin/sys/nhiRollupDb -u $NH_USER -d $NH_RDBMS_NAME -Dall -Dt Begin processing (01/09/2002 09:57:41 AM). (dbu/DbuRlpDlgApp::run) Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Wed Jan 9 10:52:13 2002)). (dbu/DuTable::createIndex) My test box: kyle% nhShowRev Network Health version: 4.8.0 D06 - Patch Level: 07 Thank You. Sincerely, J. M. Martinez-Poblete, BSEE, CCNA Senior Support Engineer, Telco Accounts 1/14/2002 11:43:33 AM yzhang this is a repeat of 19708 < 1/15/2002 1:42:51 PM yzhang I need to resize the ingres transaction log to 1.5 G, and nhResizeTransaction on your system does not work properly. and I have no permission to ftp some file to $NH_HOME/bin. Can transfer nhResizeTransactionLog from /export/sulfur1/nh48/bin to your system. Let me know when it is ready. Thanks Yulun )12/10/2001 4:35:03 PM rrick Problem: When loading the *.tdb received the following errors: Loading table nh_var_units . . . Loading the sample data . . . Loading the Dac tables . . . Creating the Table Structures and Indices . . . Non-Fatal database error on object: NH_ACTIVE_EXCEPTION_HISTORY 30-Nov-2001 20:20:26 - Database error: -33000, E_US1591 MODIFY: table could not be modified because rows contain duplicate keys. (Fri Nov 30 20:20:26 2001) Non-Fatal database error on object: NH_ACTIVE_ALARM_HISTORY 30-Nov-2001 20:20:27 - Database error: -33000, E_US1591 MODIFY: table could not be modified because rows contain duplicate keys. (Fri Nov 30 20:20:27 2001) Creating the Table Structures and Indices for sample tables . . . Granting the Privileges . . . Granting the Privileges on the sample tables . . . Load of database 'ehealth' for user 'rrick' completed with problems. I have the db in-house. They are running Win2000 Advanced Server. I reproduced this on Win2000 Professional. 12/10/2001 4:44:59 PM rtrei asked for db name, and location, info on how/when db was created. (what version,e tc) 12/10/2001 6:47:43 PM rrick -----Original Message----- From: Trei, Robin Sent: Monday, December 10, 2001 4:34 PM To: Rick, Russell Subject: 19693 Rick-- Can you give me info about the database you were loading: what was its name, where is it what version was it from? who and when was it created from? Also, where did this problem occur in house? Can I telnet or map a drive to it? > -----Original Message----- > From: Rick, Russell > Sent: Monday, December 10, 2001 4:48 PM > To: Trei, Robin > Subject: RE: 19693 > > I received the db from Jim McComber. > It is a 501 db. > It is located on pc-rrick2k/E$/Jim > > > Problem occured on pc-rrick2k/f$/eHealth501 > 12/17/2001 12:40:19 PM hbui Jay's suggestion and opinions: This is bad data in beta cycle. The tables should not have had rows with the same exception_id. To go on, just delete the bad data. Jay will do further investigation. 12/17/2001 5:26:36 PM rrick -----Original Message----- From: Rick, Russell Sent: Monday, December 17, 2001 5:15 PM To: Bui, Ha Subject: 19693 Ha, Which data needs to be deleted? Do I remove the whole table or just the individual records? If you have any problems or the issues, please feel free to contact support@concord.com , Attn: Russ Rick. My support hours are 11:30am - 8:00pm, est. Regards, Russell K. Rick, Senior Support Engineer 12/18/2001 2:08:53 PM rrick -----Original Message----- From: Bui, Ha Sent: Monday, December 17, 2001 6:14 PM To: Rick, Russell Subject: RE: 19693 Rick, Jay suggested that we should only delete bad data, not the whole tables. - Ha 12/21/2001 6:37:55 PM rrick -----Original Message----- From: Rick, Russell Sent: Tuesday, December 18, 2001 1:56 PM To: Bui, Ha Subject: RE: 19693 How do I go about doing that? First I must be able to identify the bad data and then remove it without corrupting the tables. Regards, - Russ -----Original Message----- From: Bui, Ha Sent: Wednesday, December 19, 2001 10:47 AM To: Rick, Russell Subject: RE: 19693 Oops, I forgot to tell you how :) Sorry. For the nh_active_exc_history, sql to the database, and run "delete from nh_active_exc_history where exception_id in ( select exception_id from nh_active_exc_history group by exception_id having count(*) > 1) " For the nh_active_alarm_history, do the following: "create table tem ( e integer, a integer); insert into tem select exception_id, alarm_id from nh_active_alarm_history group by exception_id, alarm_id having count(*) >1 ; delete from nh_active_alarm_history ori where exists ( select * from tem where tem.e = ori.exception_id and tem.alarm_id = ori.alarm_id ) " - Ha 12/21/2001 7:35:42 PM rrick -----Original Message----- From: Rick, Russell Sent: Friday, December 21, 2001 7:24 PM To: Bui, Ha Subject: 19693 Ha, Please check the test I ran with you SQL script. The boldface item in the output.....do you see a problem? Regards, Russ ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. Ingres Microsoft Windows NT Version II 2.0/9808 (int.wnt/00) login Fri Dec 21 19:23:31 2001 continue * Executing . . . (4 rows) continue * Your SQL statement(s) have been committed. Ingres Version II 2.0/9808 (int.wnt/00) logout Fri Dec 21 19:23:31 2001 INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. Ingres Microsoft Windows NT Version II 2.0/9808 (int.wnt/00) login Fri Dec 21 19:23:32 2001 continue * Executing . . . continue * Ingres Version II 2.0/9808 (int.wnt/00) logout Fri Dec 21 19:23:32 2001 INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. Ingres Microsoft Windows NT Version II 2.0/9808 (int.wnt/00) login Fri Dec 21 19:23:34 2001 continue * * * Executing . . . (2 rows) E_US0834 line 1, Table 'tem' owned by 'rrick' does not contain column 'alarm_id'. (Fri Dec 21 19:23:34 2001) continue * Your SQL statement(s) have been committed. Ingres Version II 2.0/9808 (int.wnt/00) logout Fri Dec 21 19:23:34 2001 INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. Ingres Microsoft Windows NT Version II 2.0/9808 (int.wnt/00) login Fri Dec 21 19:23:35 2001 continue * Executing . . . continue * Ingres Version II 2.0/9808 (int.wnt/00) logout Fri Dec 21 19:23:35 2001 ---------------------------------------------------------------------------------------------------------------------------------------------- -----Original Message----- From: Bui, Ha Sent: Wednesday, December 19, 2001 10:47 AM To: Rick, Russell Subject: RE: 19693 Oops, I forgot to tell you how :) Sorry. For the nh_active_exc_history, sql to the database, and run "delete from nh_active_exc_history where exception_id in ( select exception_id from nh_active_exc_history group by exception_id having count(*) > 1) " For the nh_active_alarm_history, do the following: "create table tem ( e integer, a integer); insert into tem select exception_id, alarm_id from nh_active_alarm_history group by exception_id, alarm_id having count(*) >1 ; delete from nh_active_alarm_history ori where exists ( select * from tem where tem.e = ori.exception_id and tem.alarm_id = ori.alarm_id ) " - Ha If you have any problems or the issues, please feel free to contact support@concord.com , Attn: Russ Rick. My support hours are 11:30am - 8:00pm, est. Regards, Russell K. Rick, Senior Support Engineer 12/26/2001 2:18:58 PM rrick -----Original Message----- From: Bui, Ha Sent: Wednesday, December 26, 2001 9:21 AM To: Rick, Russell Subject: RE: 19693 Sorry, It's my typo. It's should be tem.a "create table tem ( e integer, a integer); insert into tem select exception_id, alarm_id from nh_active_alarm_history group by exception_id, alarm_id having count(*) >1 ; delete from nh_active_alarm_history ori where exists ( select * from tem where tem.e = ori.exception_id and tem.a = ori.alarm_id ) " Ha Tested again: It ran fine this time. Thanks again, - Russ 12/28/2001 12:33:51 PM rrick Tested: - I have the customer db. - After running the script to clean up the alarm history I tried to mass modify 1400+ elements and said ok. Then hit apply on the pol< ler config screen and the server came down and tried to re-initialize. 1/8/2002 5:03:52 PM rrick -----Original Message----- From: Trei, Robin Sent: Tuesday, January 08, 2002 3:54 PM To: Rick, Russell Cc: Wolf, Jay Subject: FW: AR System Notification Rick-- If indeed this customer did an nhiLoadDb after running the 5.01 cert patch, then please escalate this ticket. I discovered this bug last week and discussed it at the patch tribunal. I will update a script to get the customer running, and then we will put the fix in the patch 2 release. -----Original Message----- From: Rick, Russell Sent: Tuesday, January 08, 2002 4:51 PM To: ts_mgrs Subject: FW: AR System Notification Please escalate per Robin......56794. 1/8/2002 5:06:25 PM drecchion From: Rick, Russell Sent: Tuesday, January 08, 2002 4:51 PM To: ts_mgrs Subject: FW: AR System Notification Please escalate per Robin......56794. 1/10/2002 7:15:39 PM rtrei Sending script to rrick to have him test and send on to customer. 1/10/2002 7:23:31 PM rtrei Included is a script which I think will get the customer back up and running. Try it on your database first. If it works for you, pass it on to the customer. The customer does not have to reload his database. I asked to have this ticket escalated for 2 reasons. One, the customer has been done for more than a month. Two, I thought there was a problem with running nhiLoadDb after applying the cert patch. It turns out that the files I was concerned about did not go into cert patch 1. This is good news. We should be able ot fix the problem before our customer sees them. This script fixes 3 of hte 4 cases where nhiLiveEx crashes after upgrade to 5.0 and thus brings the system down. It probably makes sense to have the customers run this when this symptom occurs: it does not harm the database, only removing data if it is actually bad data. If after running this script, the symtom still remains (nhiLiveEx crashing eHealth), the ticket should be immediately escalated. (I know I'll live to regret saying that :>) 1/10/2002 7:34:26 PM rrick -----Original Message----- From: Rick, Russell Sent: Thursday, January 10, 2002 7:22 PM To: Trei, Robin; Gray, Don Subject: RE: 19693 Hi Robin, I apologize for not getting back to earlier today. It has been extremely busy up here today. The customer was on ehealth 5.0.1 P1 D1 Win2000. The customer was also using the Advanced server version of Win2000. I was on ehealth 5.0.1 P0 D0 using Win2000 professional. Hope this helps, - Russ 1/14/2002 11:30:20 AM dbrooks put in field test per robin. 3/12/2002 11:47:41 AM rrick Customer upgraded to 5.0.2. Deescalting. Closing ticket. L12/10/2001 4:56:28 PM wburke Unloading table nh_active_alarm_history . . . Fatal Internal Error: Unable to execute 'COPY TABLE nh_active_alarm_history () INTO '/opt/nethealth/dbsave.tdb/aah_b48'' (E_RD0060 Cannot access table information due to a non-recoverable DMT_SHOW error (Tue Nov 27 12:29:10 2001) ). (cdb/DuTable::saveTable) Customer does not use liveHealth at all. Can we just re-create said table??? 12/13/2001 12:02:20 PM wburke We destroyed and recreated dB as customer wanted to get running asap and was unconcerned with loss of DATA. Closing 712/11/2001 9:42:20 AM wburke Error: Sql Error occured during operation (E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Mon Dec 10 16:45:23 2001) Ran through the Normal Steps 1. nhiIndexDiag - retrn multiple dups 2. cleanStats_mod - seemed to run successfully. - no concordClean.out created. 3. Manual roullup failed. - same error 4. nhiIndexDiag results in multiple dups. The customers Statistics Rollup is failing due to duplicate indexes. Running Nethealth 4.8 P6 D5 on a Solaris 2.7 OS There are 27443 dulpicate indexes in the indexdiag.out file. Path: Had the customer run the cleanStatsmod.sh script. Did a manual rollup and the rollup still fails due to duplictaes. Collected StatisticsRollup.log, StatisticsIndex.log, tables.out and indexdiag.out BAFS/57115 12/11/2001 4:37:14 PM yzhang Robin, This duplicate error from db rollup, There is another one (19689) is very similar to this. Yes, we need to consider permanent fix. here is my plan for these two 1) have Walter and Russell check the following, with these we will know whether stats0 table actually contains duplicate date (ie, stats poller insert duplicate data into stats0 table), or db rollup is trying to insert duplicate data into stats1 table. No mater which is true, we need to obtain the database. We saw the later case some time ago (problem 11800), which has been fixed by Dave Shepard. Walter, Can you check with the following: 1) how big their db size is. please make sure there is enough disk space available. previosuely we have a customer who get the same problem, but increasing disk space did actually solve the problem 2) you said you have run nhCleanStats_mod, but there is table called nh_stats0_992667599 still contain duplicates. did you run the script with clean argument 3) get the following: echo"select sample_time, element_id, count (*) from nh_stats0_992667599 group by sample_time, element_id having count (*) > 1\g" | sql nethealth >> stats0_dup.out1 echo "select element_id, sample_time, count (*) from nh_stats0_992667599 group by element_id, sample_time having count (*) > 1\g" | sql nethealth >> stats0_dup.out2 Thanks Yulun 12/12/2001 6:35:01 AM wburke requested. 12/12/2001 2:36:28 PM yzhang there is no duplicate in the existing stst0 table, the duplicate problem is caused from Db rollup.the only thing we can do in this case is to obtain customer's db, and do the rollup test in house, but be sure load customer database into the same platform, same nethealth version and same envirnment. Yulun 12/12/2001 2:44:02 PM wburke -----Original Message----- From: Burke, Walter Sent: Wednesday, December 12, 2001 2:33 PM To: 'bcraun@itcdeltacom.com' Subject: Ticket # 57115 Bill, Enginnering has requested your database: There is no duplicate in the existing stst0 table, the duplicate problem is caused from Db rollup.the only thing we can do in this case is to obtain customer's db, and do the rollup test in house, but be sure load database into the same platform, same nethealth version and same envirnment. Please tar a database save, usually in $NH_HOME/db/save and ftp up this way: ftp.concord.com login: anonymous pass: ident cd incoming bin put Send email when transfer is complete. 12/16/2001 12:48:42 PM wburke Loaded customer DB on KEG. Problem reproduced. 12/31/2001 4:00:02 PM rrick Added 57832 - Critical. 12/31/2001 4:04:12 PM rrick -----Original Message----- From: Rick, Russell Sent: Monday, December 31, 2001 3:52 PM To: 'gallop@sendai.jpl.nasa.gov'; 'dgallop@pacbell.net' Cc: Hartman, Chuck; Feagans, Kelly; Wickham, Mark; Zhang, Yulun; Burke, Walter; Poblete, Jose Subject: RE: Concord Communications Nethealth Support Reply..........RE: Call Ticket #57832 Hi Don, Currently, I have identified the following: Initial Problem: Scheduled and command line Statistical rollups failing with ... sqlErrorText: E_US1592 INDEX: table could not be indexed because rows contain duplicate keys. (Thu Dec 27 16:37:37 2001) Explanation: Stats0 tables get rolled up into stats1 tables. The way this is done is the following: Your stats0 table has good data and a good table skeleton. The calculations are commited against these stats0 tables to create a temp file(statsRollup.dat) that stored in the $NH_HOME/tmp directory. At that point the system creates a stats1 table. It then loads the temp files data into the stats1 table. When it tries to index that stat1 table it fails because it finds either one or multiple duplicate keys. A table cannot be indexed if duplicate keys exist inside of any table. The big question is how was this table populated with these duplicate key< s? Other Information: When this problem occurs, it seems to be happennning later into the rollup...so that at 500 meg of data, out of the 900meg of data that your shop collects each day, is rollup. You mentioned that you keep 4 weeks of raw statistical data. You also seem to have mentioned that you have 35 gig free, of disk space in your db. At this point, I have definitely determined that this is a bug. We have submitted Problem Ticket #19708 to address this issue. My Engineering colleagues will be returning from the holiday vacation on Wednesday, January 3rd, 2002 to address this issue. Thanks very much for all your patience. And for providing me with all the pertinent data to help resolve this issue quickly. Have a Happy New Year, Russ Rick Senior Technical Support 1/2/2002 2:25:41 PM rtrei Dave-- This ticket is about to be escalated. In it Yulun says to redirect it to you after the initial investigation. Have you ben kept in the loop regarding these rollup failures? Do you need anyone from the db team to do some analysis for you? Let me know asap. 1/2/2002 3:09:53 PM dshepard Robin - Yulun told me this might be coming, but that was it. Last I talked to him I requested that someone from the Db team point me to the duplicates in the Db. I don't have the knowledge or tools to find them by myself. I am hoping once I see some examples of the duplicate records I will be able to pursue it myself. 1/3/2002 2:29:44 PM rrick -----Original Message----- From: Kelly Feagans [mailto:kfeagans@concord.com] Sent: Thursday, January 03, 2002 2:08 PM To: Russell Rick Subject: JPL ticket 57832? Importance: High Russ: Any news? Has JPL been contacted today? kf -- Kelly Feagans Concord Communications 818-249-6390 818-249-6526 FAX -----Original Message----- From: Rick, Russell Sent: Thursday, January 03, 2002 2:17 PM To: Trei, Robin; Shepard, Dave Cc: Feagans, Kelly Subject: FW: JPL ticket 57832? Importance: High Hi Folks, Do we have any info on this issue yet? I need to give the customer some type of update today. Sorry to be pushing at you. - Russ 1/7/2002 12:39:05 PM rrick -----Original Message----- From: Donald L Gallop [mailto:gallop@sendai.jpl.nasa.gov] Sent: Monday, January 07, 2002 10:28 AM To: RRick@concord.com Subject: RE: nhDbRollup Russ, What is the latest on this? Nethealth is still working but the stats db is up to 34.3G don 1/7/2002 4:26:15 PM dshepard Changing this to WIP while I wait for info from the database group. 1/8/2002 5:18:20 PM yzhang Dave, Here is my investigation regarding the duplicate: the example duplicate: element type is: empire_unix_processSet.mtf (mtf_name) element_id: 1036064. Now I handle this one to you Don, I will write a script to keep customer going pioir to Dave's fix For your reference, Here is the procedure I used to find the duplicate: 1) run stats rollup in debug mode to find the stats1 table: it is nh_stats1_1005886799 2) manually create this stats1 table with non unique index 3) run stats rollup again 4) run the following query to find out the duplicated element_id * select sample_time, element_id, count (*) from nh_stats1_1005886799 group by sample_time, element_id having count (*) > 1\g Executing . . . lqqqqqqqqqqqqqwqqqqqqqqqqqqqwqqqqqqqqqqqqqk xsample_time xelement_id xcol3 x tqqqqqqqqqqqqqnqqqqqqqqqqqqqnqqqqqqqqqqqqqu x 1005821741x 1036064x 2x mqqqqqqqqqqqqqvqqqqqqqqqqqqqvqqqqqqqqqqqqqj (1 row) continue * select element_id, sample_time, count (*) from nh_stats1_1005886799 group by element_id, sample_time having count (*) > 1\g Executing . . . lqqqqqqqqqqqqqwqqqqqqqqqqqqqwqqqqqqqqqqqqqk xelement_id xsample_time xcol3 x tqqqqqqqqqqqqqnqqqqqqqqqqqqqnqqqqqqqqqqqqqu x 1036064x 1005821741x 2x mqqqqqqqqqqqqqvqqqqqqqqqqqqqvqqqqqqqqqqqqqj (1 row) 5) run the following query to find out the element mtf_name: select mtf_name from nh_element where element_id = 1036064\g Executing . . . lqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqk xmtf_name x tqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqu xempire-unix-processSet.mtf x mqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqj (1 row) 1/9/2002 7:12:02 PM dshepard I was unable to spend time on this today, and am out til Monday. This is going to be a tough ticket for someone else to take. It requires intimate knowledge of how the poller interacts with the database and how it does its calculations and data integrity checks. I'll let Escalation team decide if it is worth assigning to Santosh while I am gone. 1/10/2002 4:57:27 PM rrick -----Original Message----- From: Donald L Gallop [mailto:gallop@sendai.jpl.nasa.gov] Sent: Thursday, January 10, 2002 4:38 PM To: RRick@concord.com Subject: RE: Concord Communications Nethealth Support Reply..........RE: C all Ticket #57832 & Problem Ticket #19708 Russ, The shell gave me the prompt back but there is no file Cleanstats0.out. Is the script still running? How will I know when the script is done? When the Cleanstats0.out file is created? don -----Original Message----- From: Rick, Russell Sent: Thursday, January 10, 2002 4:44 PM To: Zhang, Yulun Subject: FW: Concord Communications Nethealth Support Reply..........RE: C all Ticket #57832 & Problem Ticket #19708 It looks like the script finished, but did not produce any output. What do think? - Russ 1/14/2002 1:19:32 PM jpoblete I'm adding Call Ticket 56941 to this problem ticket, I have the customer's DB and I have reproduced the problem here. I have a debugger output from the rollup showing the error, it is located in the call ticket dirtectory\Jan09 1/14/2002 4:03:59 PM jpoblete -----Original Message----- From: Poblete, Jose Sent: Monday, January 14, 2002 3:41 PM To: Zhang, Yulun Cc: Gray, Don; Piergallini, Anthony; Trei, Robin Subject: RE: 19689 - Rollup Duplicated keys ... Yulun, Here is what you asked.... * select name from nh_element where element_id = '1036064'\g Executing . . . name xnis0-la2.snfc21-1691-SH-NIS (1 row) continue I looked at the other steps, but seems you aleady completed them.... Is this all that you need ? -JMP 1/15/2002 4:00:31 PM dshepard When I agreed to accept this ticket from the Db group, I did not realize the problem was duplicates in the stats1 tables. The poller writes the stats0 tables and is responsible for duplicates in those. It does not read, write, or even know about the stats1 tables. Those are written by rollups and have nothing to do with my group. If there were duplicates in the stats0 tables, the problem would have showed up with rollups failing and index runs failing. Since that didn't happen, that suggests to me that there are no problems in the stats0 tables. Furthermore, once the stats1 tables are created, the stats0 tables are deleted. So there is nothing for me to even go look at. From what I can see, there is no similarity between this issue and past issues with stats0 tables such as ticket 11800. I see two possibilities: 1) The rollups are generating duplicate data due to a bug and inserting them into the stats1 tables. 2) The rollups are failing to identify duplicates in the stats0 tables and adding them to the stats1 tables as well. Either way there is a problem that needs to be resolved in rollups first. I am therefore transferring this back to the Db group. 1/16/2002 9:37:43 AM yzhang Robin, Just let you know that this one came back to me because Dave Shepard think the duplicate generated from stats rollup is in stats1 table, not in stats0 table, so he think this is not a poller issue. 1/16/2002 8:14:34 PM yzhang Dave's change for prob. 11800 was merged into the patch that this customer is using. I think that this problem is very similar to 11800. both have du< plicate from stats rollup on stats1 table due to the processSet element. more investigation will continue. 1/22/2002 10:17:39 AM jpoblete Yulun, Sorry to bug You, any news on this one ? 1/23/2002 5:22:45 PM yzhang provided workaround with a script, now I am debugging the problem for permanent fix 1/25/2002 5:25:19 PM jpoblete The workaround did OK -----Original Message----- From: HEADSPETH, KATHY (SBIS) [mailto:kh4213@sbc.com] Sent: Friday, January 25, 2002 10:53 AM To: Poblete, Jose Subject: RE: Concord Call Ticket 56941 Hi Jose, Good news!! My rollup took until 12:20 this morning to complete but, it completed and my database is down from 12 gig to 1.1 gig. Yeah!!!! Thank you for your help and I will put the file you asked for in the incoming subdirectory on your ftp server. Thanks again! Kathy 2/15/2002 2:29:42 PM jpoblete Closing call ticket 56941. 3/4/2002 10:04:40 AM jnormandin Yulun. When will the fix be available for this issue ? I also have an associated call ticket, 57832 .Can I get a workaround for this issue like 56941 ? Thanks Jason 5/2/2002 6:55:31 PM yzhang script cleaned the duplicate, more investigation will be done for problem 22116 12/11/2001 1:53:54 PM jnormandin Data analyis job fails after aprox 20 minutes ( same time frame each time. ). The data_analysis_xxxxxxx.log file does not list any errors nor does it state Begin processing or End processing. After the process dies, a core file is produced. I had the customer run the da job utilizing the following debug flags: -Dm cu:cba:sre:sde:cdb -Dfall. I consulted with Jay Wolf in regards to the core file and the debug output. Jay's thoughts were as follows: It is failing on the setup of the files and they had restructured the code for HP. The way it was working and this customer still has it is: For all service profiles open 5 files Write to all open files Close all open files. Now the change is a loop: For each service profile open 5 files write to 5 files close 5 files. All related files can be found on Bafs for ticket # 56126 12/11/2001 3:06:23 PM yzhang Jason, This looks similar to 15181, (that is on HP), can you get the following: 1) amount of memory and disk space 2) echo "select count (*) from nh_rpt_config\g" | sql nethealth > config.out 3) nhSchedule -list > list.out 4) configaration parameters and values Robin here is the advanced debug, it is open file in the tmp directory failed 12/13/2001 9:37:15 AM yzhang Jason, Thanks for getting the information so quickly. Can you also get the /etc/system file from customer. Yulun 12/14/2001 3:37:38 PM jnormandin On bafs. 12/18/2001 10:49:59 AM yzhang Jason, Can you place nhiDataAnalysis from ~yzhang/remedy/19729 on ftp outgoing, and have customer run it with following command after increaing the max shared memery as described. nhiDataAnalysis -Dmall -Df tTzZdp -now 08/01/2001 >& da_debug_new.out check with customer regarding the -now option 12/18/2001 5:43:09 PM yzhang update to field test 1/9/2002 10:20:14 AM jnormandin - After applying the new nhiDataAnalysis, the result is still the same. A core file has been produce. - All log files have been saved to Bafs for ticket 56126 in the 1.07.02 folder 1/16/2002 3:36:50 PM jnormandin Any update ? 1/24/2002 3:53:41 PM yzhang can you find out where is the core file originally generated from. and also please obtain the databse 1/28/2002 3:07:12 PM cpaschal From: Paschal, Christine Sent: Monday, January 28, 2002 2:58 PM To: Zhang, Yulun Subject: 19729 (56126) - Data Analysis silently failing and generating core file Importance: High Hi Yulun, The customer's database save has been saved to: \bafs\escalated tickets\56000\56126\1.28.02\Intria_nethealth.tdb.tar In the problem ticket history, you stated you also need to know where the core file is originally generated from. Do you mean the directory where the core file was found? If not, can the information you are looking for be found in the core file itself? Thanks, Chris 1/28/2002 3:15:18 PM yzhang yes, I want to know the directory where the core file was found. Can you load the customer's db in the same envirnments as customer has ; (such as, platform, ehealth version, patch level configration setting), the run dataAnalysis to see if you can reproduce the problem. (make sure check the environement with customer before you loading db) Thanks Yulun 1/28/2002 3:15:47 PM yzhang yes, I want to know the directory where the core file was found. Can you load the customer's db in the same envirnments as customer has ; (such as, platform, ehealth version, patch level configration setting), the run dataAnalysis to see if you can reproduce the problem. (make sure check the environement with customer before you loading db) Thanks Yulun 1/28/2002 3:30:27 PM cpaschal I've forwarded your question on to the reseller. I'll remove 5.0 from my Sol 2.8 box and install 4.8 p3/d3, then load the db and test the problem. I'll let you know my results from running data analysis as soon as I possibly can. Thanks, Chris 1/29/2002 2:35:41 PM cpaschal NOTES: Jason thinks the core was found in $NH_HOME unexpected EOF when trying to untar the db save, requested it be sent again 2/7/2002 1:51:35 PM jnormandin Call ticket has been closed. - Customer rebuilt eHealth system and problem is gone. - Closing problem ticket @12/11/2001 2:04:35 PM rkeville Nethealth database unable to recover from deadlock on iirelation. - Solaris 2.7 - Network Health 4.8 P03/D05 - Database went inconsistent following a deadlock on iirelation, customer lost 22 hours of data. - The network health system still appeared to be polling during this time, however it was not. - E_DM9042_PAGE_DEADLOCK Deadlock encountered locking page 1615 for table iirelation in database nethealth with mode 5. Resource held by session [20685 a6f2]. ################################################### 12/12/2001 1:34:50 PM yzhang the database is getting inconsistant in such a short time interval. you might want to get sysmod on nethealth and iidbdb, and verifdb for system catalog, which might tells you why the datalock occurs. with these information, you might want to create an issue with CA. The other minor problems are: dmt_show error, and warning message from dataalaysis. Yulun 12/12/2001 2:29:28 PM yzhang Bob, I am betting their database system catalog currupt again, sysmod nethealth, and sysmod iidbdb, will tell us if this is the case. verifydb -report -sdbname nethealth -odbmas_catalogs -u$ingres will tells us what actually the problem is. when did you have customer recycle database? you can use the exeisting CA issue to work on this one. Thanks Yulun 1/11/2002 2:30:24 PM yzhang sent two new excutables, which is a fix for deadlock causing by stack dump. waititng to see if these help. 2/11/2002 3:06:32 PM dbrooks closed. will reopen if customer requests. 12/11/2001 2:26:01 PM beta program PWC - Pricewaterhouse Coopers: Don Mount donald.d.mount@us.pwcglobal.com 813-348-7252 INSTALL.NH script failure Beta2 Solaris Upgrade eHealth5.5 beta1 to beta2. I was getting an error message when trying to run the INSTALL.NH script for beta2 install. INSTALL Log error message Can`t write to the directory `/opt/concord`: > Can`t set ownership of files in /opt/concord. The problem was resolved by adding an ingres user to the system. Although I am running oracle and have a oracle user configured. 12/11/2001 2:29:06 PM beta program Beta Call for PWCGLOBAL - Installation issue for 5.5 Beta 2 to Lee Lopilato -----Original Message----- From: Lopilato, Lee Sent: Tuesday, December 11, 2001 1:39 PM To: Tisdale, Lorene; Fintonis, Melissa; Biando, Monique Cc: Keville, Bob; BetaGroup Subject: Beta Call for PWCGLOBAL - Installation issue for 5.5 Beta 2 Folks; This issue was a strange one and the customer is writing a bug and will forward it in. ISSUE: eHealth inst< all fails, while installing 5.5 beta2. Error message indicates no Ingres user. SOLUTION: The customer created a user account for Ingres and was then able to install. The customer is now completely installed. 12/17/2001 12:29:15 PM sorr Fixed. I found a couple of places that su'd to ingres when it should have checked if it was an oracle kit. 2/26/2002 10:41:49 AM Betaprogram customer unable to verify - DID NOT TEST SINCE INGRES WAS ADDED TO SYSTEM BEFORE B4 UPGRADE c12/12/2001 10:49:15 AM beta program EQUANT: The elements are being added, but only one group is being created (cfcnet) with one device in it (bsea038-RH). LOG ATTACHMENT files: DCI File and System Log found in memo with Subject "Problem ticket # FW: eh 55 b2 Solaris DCI issue" in Outlook public folder Engineering -> Beta Test -> 5.5 -> Beta Sites Active -> Equant. 12/12/2001 1:23:22 PM wzingher This may be a duplicate of the discover problem (19444) that Lingli is working on. Assigning to Lingli and I'll follow up if its a different problem. 12/12/2001 4:24:28 PM lding In the DbsMsgHandler, there is hack code (break statement) to block the while loop to handle multiple groups. Looks like replication problem. 12/14/2001 7:37:57 PM tctang setting to fixed 3/11/2002 2:20:20 PM Betaprogram Email from Mike Loewenthal, BETA AE: Bug 19759, 20306: I believe this bug has to do with the DCI Import that Tom Manes was doing, he would be a better person to test this, as this was part of a ProServe customization that they did for Equant. *Forwarded mail to Tom Manes for Verification purposes based on note above.- 3/11/02* 3/11/2002 2:34:43 PM Betaprogram Nital, Please verify these bugs are fixed for Equant. Melissa, Nital Gandhi is the new CCRD contact for Equant. Please send all beta correspondence to her. Thanks, Donna  12/12/2001 11:16:50 AM cestep The customer is trying to do a new installation of Nethealth 4.8 on HPUX 11.0. They have 4 GB of space on the partition that they are installing Ingres on. During the installation of ingres patch #6793, the install fails. Looking at $II_SYSTEM/ingres/files/install.log, we see the following: ----------------------------------------------------------------------------------------------------------------- Starting DBMS server (default)...II_DBMS_SERVER = 52031 Initializing the Ingres system catalogs... Creating database 'iidbdb' . . . Setting up the Ingres Intelligent DBMS... The configuration file is locked by another application (config.lck exists). application: Ingres Intelligent DBMS setup host: ovopap02 user: ingres time: Tue Dec 11 14:14:12 2001 Installing Terminal Monitor utility files... ------------------------------------------------------------------------------------------------------------------ I also found the following in install_p6793.log: ---------------------------------------------------------------------------------------------------------- Initial validation of patch and installation env. complete. Starting the install program at: 11-Dec-2001 09:13:27 Install log: /export/ingres/ingres/files/install_p6793.log User command: ./utility/iiinstaller -i -s Executed from: /export/ingres/ingres/patch6793 Warning: The "install only" flag was supplied. The install program will not perform a backup of the installation. Computing disk space requirements ... Installation partition has insufficient disk space. Dir: /export/ingres/ingres Free space info (in kbytes): Available: 0 Amount needed: 33763 [B]ypass disk space check, uit install program: q Quitting install program as per user request. Ending at: 11-Dec-2001 09:14:08 ----------------------------------------------------------------------------------------------------- It appears that ingres is calculating the disk space incorrectly. I can not find where in the INSTALL.NH script it gives the quit command for iiinstaller. If we could get an install script that will bypass this issue, then the install should complete successfully. We are sure that the customer has enough space - around 4 GB. 12/12/2001 2:04:33 PM yzhang can you get everything in the $NH_HOME/tmp directory, especially the ingres relalated temp output. have customer do an ingstart, see if there is same error message comming out. The problem is that the space could not be allocated during ingstart even though there is enough disk space on the system. get this as soon as possible, I will consider creating issue with CA based on the information you provide. Thanks Yulun 12/12/2001 2:04:55 PM yzhang change to more info 12/14/2001 12:02:55 PM cestep Sent the modified install script, and the install completed successfully. We can close this. 12/18/2001 9:25:17 AM yzhang problem solved d12/13/2001 9:16:49 AM beta program Alcatel: Reinhard Pfaffinger Rein.Pfaffinger@alcatel.com > Senior Network Engineer > ARIS Network Services > Alcatel USA, Inc. > Phone: (972) 519-4943 > FAX: (972) 477-1210 > http://www.usa.alcatel.com/ > -----Original Message----- > From: Reinhard Pfaffinger [mailto:Rein.Pfaffinger@alcatel.com] > Sent: Wednesday, December 12, 2001 4:43 PM > To: betaprogram@concord.com > Subject: beta 2 installation failure > > Hi, > > The beta 2 installation failed. Oracle install went clean but > apparently there is not enough room on the /apps disk for a small > installation: > > df -k > Filesystem kbytes used avail capacity Mounted on > /dev/dsk/c0t0d0s0 6694053 4585147 2041966 70% / > /proc 0 0 0 0% /proc > fd 0 0 0 0% /dev/fd > mnttab 0 0 0 0% /etc/mnttab > swap 2648512 8 2648504 1% /var/run > swap 2648904 400 2648504 1% /tmp > /dev/dsk/c0t1d0s7 8759116 1474843 7196682 18% /apps > trendsvr:/apps/share/ftp > 26109793 1398446 24450250 6% /apps/ftp > > I can make room on the / disk and try to get Oracle install on there and > use the whole /apps partition for DB and eHealth. Here is the install > log: > > start installation at Wed Dec 12 11:27:58 CST 2001 > > Before installing eHealth, you should make sure that: > > 1) The account from which you will run eHealth exists > > You will need to supply the following information: > > 1) The directory eHealth will be installed in > 2) The name of the user that will run eHealth > > ---------------------------------------------------------------------------- > -- > eHealth Location > --------------------------------------- > Where should eHealth be installed? > '/apps/neth' doesn't exist. > Do you want it created for you (y|n)? [y] > grep: can't open /etc/nh.install.cfg > ---------------------------------------------------------------------------- > -- > eHealth User > --------------------------------------- > From which account will you run eHealth? > ---------------------------------------------------------------------------- > -- > eHealth Date format > --------------------------------------- > eHealth can display dates in one of the following formats. > > 1) mm/dd/yyyy > 2) dd/mm/yyyy > 3) yyyy/mm/dd > 4) yyyy/dd/mm > > What date format should eHealth use? (1|2|3|4) [1] > ---------------------------------------------------------------------------- > -- > eHealth Time format > --------------------------------------- > eHealth can display times in one of the following formats. > > 1) 12 Hour clock > 2) 24 Hour clock > > What time format should eHealth use? (1|2) [1] > ---------------------------------------------------------------------------- > -- > Web Reporting Module > --------------------------------------- > An HTTPD Web server will be installed. > Do you want this Web server to start automatically? [y] > What port should the Web server use? [80< ] > ---------------------------------------------------------------------------- > -- > Oracle Database Table Setup > --------------------------------------- > > You will now be given the option of whether you want your Oracle > database > to be created and to have its initial load. > > Do you want the creation of the oracle database to occur? (y|n)? [y] > -------------------------------------------- > --------------------------------- > Distributed Console > --------------------------------------- > Distributed consoles are used only in an eHealth clustered environment. > Distributed consoles do not poll and cannot discover elements. > For more information, refer to the eHealth Installation Guide. > > Do you want to install this system as a distributed console? [n] Please > select whether you want to install using > the small medium > or large model. This choice will determine the set of sizes used to > create your tablespaces and tables. > 1) small > 2) Medium > 3) LARGE > Please enter the number of your selection : 1 > ---------------------------------------------------------------------------- > - > Database Directories > --------------------------------------- > Oracle databases require the creation of a number of tablespaces > distributed over several disks. In order to create the database, the > install program needs to know which directories to create these > tablespaces in. eHealth supports between 1 and 9 directories for > tablespaces. Each directory must be in a different device. > > Enter number of directories to use for tablespaces : Enter directory 1 : > Error: There is not enough room in the s > um of the disks for the database. > > Cleaning up... ------------------------------------------------------------------------------ "Fintonis, Melissa" wrote: > > what OS and Version are you running? ----------------------------------------------------------------------------- Solaris 2.8 oracle 8.1.7 eh 5.5 beta2 Server only has two 9.1GB drives. Thanks! 12/13/2001 9:43:08 AM beta program Mike is onsite: From: Loewenthal, Michael Sent: Thursday, December 13, 2001 9:32 AM To: 'Rein.Pfaffinger@alcatel.com' Cc: Betaprogram Subject: RE: beta 2 installation failure Hi, I have installed 5.5 Beta on NT at one of our sites using two 9GB drives. I have to use one whole 9GB drive for the Oracle tablespaces and the other 9GB drive held the OS, Oracle and eHealth. Today I am going to install 5.5 Beta on a SUN machine with two 9GB drives. I will let you know how that install goes and how I got it to work. Thanks. Michael Loewenthal BETA Application Engineer Concord Communications, Inc. 600 Nickerson Rd. Marlboro MA 01752 E-Mail: MLoewenthal@concord.com Office: +1-508-486-4598 Mobile: +1-203-820-6940 Fax: +1-508-486-4599 12/14/2001 10:23:24 AM wzingher The error message for not enough disk space was displayed correctly. This is not a bug. "12/13/2001 9:19:46 AM beta program CCRD AE for CSC: (Mike Loewenthal Submitted) Victor Wiebe vwiebe@csc.com 302-391-8862 While creating DB, it is looking for the directory "%NH_HOME%/tmp" which wasn`t created in the install to write some temp files to rather than using the TMPDIR or TMP system variable. -Mike, the Beta AE- 12/13/2001 11:02:13 AM rlindberg I believe this is fixed in B3. If so, we may need to send out an alert to B2 NT customers. 12/13/2001 12:57:19 PM tfang Here is the email sent to Mike: Hi Mike, During the b2 nt installation, the $NH_HOME\tmp may not be created. (It is fixed in b3.) Please manually create the "tmp" directory under $NH_HOME (i.e. c:\eHealth\tmp) and set the envrionmental variable TMPDIR to "%NH_HOME%/tmp" before running nhCreateDb. Please let me know if you have any question. Thanks! -Tracy 12/14/2001 9:33:45 AM tfang Refer to #19630 1/11/2002 8:00:03 AM mfintonis marking repeat 2/15/2002 2:29:53 PM mfintonis Status - Repeat of #19630 which was Fixed in Beta 3 - 5.5 12/14/2001 12:16:04 PM bmiller Contact: Drew Golden Phone: 972-781-1747 Email: sgolden@concord.com The customer is running eHealth 5.0.2 on NT server 2000. The machine has 512 MB RAM, and they are polling 1000 statistics element. In looking at the customer's system log, there are many notations of updates to the poller configuration, but after some of these, we see the error "nhiPoller[Net] Pgm nhiPoller[Net]: Sql Error occurred during operation (E_RD0045 Failed to gain access to a ULH object due to ULH error". After this, comes the following error "nhiPoller[Net] Pgm nhiPoller[Net]: Assertion for '0' failed, exiting (in file ../PollTimer.C, line 1404)." In each case, the console crashes after these errors. The pollerAudit logs related to the updates of the poller configuration previous to the crashes, all reflect an attempt to modify a Response Source element The system logs, event logs, pollerAudit logs, and other collected data is on: BAFS/escalated tickets/57000/57556 12/14/2001 4:06:11 PM yzhang Created issue with CA regarding ULH error 1/11/2002 10:40:07 AM yzhang Let's close it if there is no way for us to obtain the errlog file 12/14/2001 12:44:16 PM beta program Vic Wiebe Computer Sciences Corp. 302-391-8862 vwiebe@csc.com Lorene: This is the error I receive on nmcde82: nhiDbStatus.EXE: Internal Error: Expectation for 'sqlca.sqlcode == 0' failed (disconnecting from the database in file ./duDatabaseSql.C, line 489). (cu/cuAssert) This server also has Sybase installed on it for a production application - could this be interfering? Vic 12/19/2001 11:28:54 AM wzingher Changes made by Ravi during B3 should fix this. Marking fixed. 12/17/2001 2:06:00 PM beta program CCRD MIS: (Submitted by Mike Loewenthal, CCRD PM BETA AE) Alan Baker abaker@concord.com 508-486-4452 While installing the Concord MIS Beta box, the create DB from the install as well as the command line is not allowing the creating of a database when specifing 2 directories. I have created a DB using 1 directory from the command line. -Mike Loewenthal, the Beta AE- 12/17/2001 2:50:37 PM rlindberg re-assign to Steve to evaluate 12/19/2001 8:11:54 AM rlindberg steve looked at the code and would need more info. I would like MIS to try this again with B3 when we release it since we fixed a bunch of problems with install. marking MoreInfo and pushing to B4. 1/10/2002 11:37:59 AM mfintonis Rob, this was assigned to Steve Orr, who is now gone. since you were the last person to touch it, I'm reassigning to you thanks Melissa 1/14/2002 3:58:44 PM beta program sent email to MIS and Beta AE, this needs retest in B3! 1/14/2002 5:05:30 PM rlindberg Internally, we have re-tested this and don't see a problem. Marking NoDupl From Patricia Stinney: If this test is for the use of spanning the EHEALTH database across two different partitions/directories, but not physical drives, then yes I have installed using this configuration on a WindowsNT system(ATTU) in the QA Lab with the oracle55_ndnt kit from last Thursday(1/10/2002). );12/17/2001 3:22:31 PM cestep The customer just did a new install of eHealth 5.0.2. He imported a database from eHealth 4.8, and on the console he sees "Server stopped unexpectedly", and it is unable to restart. I found problem ticket #19333, which seemed to have the same symptoms, and it pointed out that the Live Exceptions server was causing the failure. However, problem ticket #19333 was declined because the customer was upgrading from a Beta release of 5.0, and it was not supported. This configuration 4.8 -> 5.0.2 should be supported. I had the customer disable the Live Exceptions server in $NH_HOME/sys/startup.cfg, and this allowed eHealth to come up and start polling. However, the customer needs to be able to use the Live Exceptions product. He sent me the output of the following: echo "select * from nh_group_list;\g" | sql ehealth > grouplist.out echo "select * from nh_group_list_members;\g" | sql ehealth > groupmembers.out echo "select * from nh_subject;\g" | sql ehealth < > subject.out I also obtained advanced logging from the Live Exceptions server for when the failure occurs. All files are on BAFS, under ticket #57449. 12/17/2001 4:07:15 PM rtrei Colin-- Although this is a case of the nhiLiveExSvr dying, it isn't clear to me that it is exactly like the other ticket. That one had issues with corrupted data, and I'm not sure we are seeing that here. I do wonder why the nh_subject table has more data than the nh_group and nh_group_list tables. I am ccing Jay as I am not sure which one of us will end up taking this, so please keep us both in the email for now. Could you please do the following: If it easier for the customer to ftp the db he loaded, please get that immediately. Otherwise, please do the following: echo "copy table nh_subject () into '$NH_HOME/tmp/nh_subject.dat\g" | sql ehealth echo "copy table nh_group () into '$NH_HOME/tmp/nh_group.dat\g" | sql ehealth repeat the command for the following tables: nh_group_list nh_alarm_rule nh_alarm_attribute nh_alarm_threshold nh_alarm_event_threshold nh_exc_profile echo "select subject_type, group_type, name, expire_time, count(*) from nh_subject group by subject_type, group_type, name, expire_time having count(*) > 1\g" | sql ehealth > dups.out > -----Original Message----- > From: Estep, Colin > Sent: Monday, December 17, 2001 3:19 PM > To: Trei, Robin > Subject: Problem ticket #19887 > > Hi Robin, > > I just logged problem ticket #19887. Please take a look at > BAFS, ticket #57449/12.14.01 and 57449/12.17.01. > The customer uses Live Exceptions quite a bit, so I didn't > want to proceed with deleting any service profile ID's until > you had looked at this information. > > Thanks, > > Colin Estep, Senior Support Engineer > Concord Communications, Inc. > http://www.concord.com > 600 Nickerson Road, Marlboro, MA 01752 > Toll Free: 888-832-4340 > Fax: 508-303-4343 > Intl: 508-303-4300 > **************************************************************** 12/17/2001 4:25:33 PM cestep requested information indicated above. 12/18/2001 7:51:04 AM cestep I have obtained the requested data from the customer. It's on BAFS, under ticket 57449/12.18.01/InfoData. 12/19/2001 4:23:57 PM cestep -----Original Message----- From: Trei, Robin Sent: Wednesday, December 19, 2001 4:03 PM To: Pattabhi, Ravi; Bui, Ha; Estep, Colin Cc: Wolf, Jay; Venuto, Donna Subject: ticket # 19887 This ticket is non-escalated, but very serious. It deals with duplicates we are putting in the nh_subject table at the time of the database conversion (or db load). Ultimately, we will create a one-off and put a fix in the patch, but we do not have enough data yet. I still haven't gotten the database to confirm this, but looking over the customer's group tables, I found that the nh_subject table contained 1096 rows. Of these 228 were duplicates. When I looked at the time stamp on the duplicates I saw that 114 of them were created on Dec 13th, which is when the database conversion was done. Of the 228 duplicates, 136 were groups, the remainder being group_lists. I looked in the nh_group table and found 136 groups there, too, all with ids matching to the duplicates in the subject table. I then looked in the nh_group_members table and found that-- of the duplicate ids-- only those created on Dec 13th were showing up in the group_members table! The only way I can currently explain this data is that when we were doing the conversion for groups, 68 times we found a row in the nh_subject table and inserted it into the nh_group table and then failed to open the corresponding group file. Later, after we had processed the nh_subject table we look for leftover .grp files. We presume they have no associated row in the nh_subject table, so we create a row in the nh_subject table and in the nh_group table (causing the duplicates) only this time we were able to open the files and process the data! I originally thought it might be case sensitivity, but it can't be because the name column is identical in the duplicates. At one point I thought it might be file permissions problems but how could we fail to open the file the first time and succeed the second time? So I do not know what caused this to happen at present. Below I have an explanation of what is going on in the script, and I've included a script which I believe will get the customer going. (Assuming that this is the customer's problem.) Colin is trying to get the database from the customer but it hasn't come in yet. When it does, it should be loaded on my NT system (which is all set up for 5.0) and then test the script. Ha would be best to create a debuggable nhiLoadDb and step through it to see what happens. (Sorry Ha) If the script removes the duplicates, bring up the console and test that groups can be editted. If so, send the script to the customer to run, and Tech Support can keep it available for this problem occurs again. There is a possibility that the customer database will load with no duplicates on my system, if that is the case it is most likely file permission problems (customer had some issues in this area prior to this problem), and Colin can give the script to the customer and have him run it there. I have run the script with the customer's group tables replacing my own database tables so I know it will remove the dups. ************* Here are the steps I am going to propose for the script. This will need to be done for both groups and group_lists. Determine what are the duplicates: create table new_dups as select name, group_type, subject_type, expire_time, count(*) from nh_subject group by name, group_type, subject_type, expire_time having count(*) > 1;\g Then get the problem ids associated with the duplicates: create table dup_ids as select a.subject_id, a.name, a.group_type, a.subject_type, a.expire_time, a.create_time from nh_subject a, new_dups d where a.name = d.name and a.group_type = d.group_type and a.subject_type = d.subject_type and a.expire_time = d.expire_time and a.subject_type = 5010\g Dup_ids will contain both the 'old' and the 'new' duplicates. Remember, the 'old' duplicates are the ones LE has been working with which will have data associated with their ids. Yet it is the 'new' duplicates that have the group_member expansion that we now need. So we are going to re-id all the group_member rows containing the 'new' id with the 'old' id value, and then remove the 'new' id duplicates from the nh_group and nh_subject tables. create table grp_deletes as select a.subject_id as old_id, a.name, a.group_type, a.expire_time, a.subject_id as new_id from dup_ids a where a.subject_type = 5010 and a.subject_id not in (select distinct group_id from nh_group_members)\g update grp_deletes from dup_ids set new_id = dup_ids.subject_id where grp_deletes.group_type = dup_ids.group_type and grp_deletes.name = dup_ids.name and grp_deletes.expire_time = dup_ids.expire_time and dup_ids.subject_type = 5010 and grp_deletes.old_id <> dup_ids.subject_id\g update nh_group_members from grp_deletes set group_id = old_id where group_id = new_id delete from nh_subject where subject_id in (select new_id from grp_deletes) delete from nh_group where group_id in (select new_id from grp_deletes) 12/19/2001 4:25:02 PM cestep Will wait to send the script, until it can be tested here. 12/20/2001 11:20:59 AM jnormandin - Associating call ticket 57857 . Same issue, nhiLiveExSvr dying after 4.8 to 5.02 upgrade. I have disabled the nhiLiveExSvr to enable polling and have requested the db. Once I have this , i will run the diagnostics run by Collin to ensure that the issues are the same. 12/27/2001 10:09:42 AM wburke -----Original Message----- From: Burke, Walter Sent: Thursday, December 27, 2001 9:58 AM To: Pattabhi, Ravi Cc: Wickham, Mark Subject: PT # 57449 - dups on nh_subject Ravi, I understand there is a script to be tested against the customers database inhouse. WE have loaded the dB on pc-mwickham.concord.com 5.0.2 NT. Please ad< vise as to the next step. error: Non-Fatal database error on object: NH_SUBJECT 26-Dec-2001 18:37:17 - Database error: -33000, E_US1591 MODIFY: table could not be modified because rows contain duplicate keys. Thanks, Walter 12/28/2001 12:57:29 PM wburke Script is on BAFS/57449 - sent to customer, appeared to fix nhiLiveEx rpoble, 1/2/2002 10:00:02 AM wburke -----Original Message----- From: Boake, John [mailto:John.Boake@jacobs.com] Sent: Wednesday, January 02, 2002 8:33 AM To: 'Burke, Walter' Subject: RE: Ticket # 57449 - Ingres Failure Walter, Live Exceptions is now running but I can't add any Subjects To Monitor. Every time it tries to apply the changes, it comes back saying that it can't connect to the server. I appears that there are things being monitored but I have not added any nor are there any in the Subjects to Monitor list. 1/2/2002 10:02:33 AM wburke -----Original Message----- From: Burke, Walter Sent: Wednesday, January 02, 2002 9:51 AM To: Trei, Robin Subject: PT # 19887 - liveExSrv Hi Robin, Happy New Year! I had customer run script: << File: CleanUpGroups.sh >> However, two subsequent problems: 1) Another thing I have encountered is when I try to access or edit my Group lists it causes the server to crash on a critical error. 2) Live Exceptions is now running but I can't add any Subjects To Monitor. Every time it tries to apply the changes, it comes back saying that it can't connect to the server. I appears that there are things being monitored but I have not added any nor are there any in the Subjects to Monitor list. ________________________________ 1/2/2002 10:11:53 AM wburke -----Original Message----- From: Burke, Walter Sent: Wednesday, January 02, 2002 10:00 AM To: 'Boake, John' Subject: RE: Ticket # 57449 - Ingres Failure John, Run: Console -> setup -> adv. logging -> check Database, Console, Messaging. Reproduce Crash. Turn off adv. logging for all. Send $NH_HOME/log/advanced/nhiDbServer_dbg.txt, nhiConsole_dbg.txt, and nhiMsgServer_dbg.txt. Sincerely, 1/2/2002 12:39:23 PM wburke -----Original Message----- From: Boake, John [mailto:John.Boake@jacobs.com] Sent: Wednesday, January 02, 2002 12:27 PM To: 'Burke, Walter' Subject: RE: Ticket # 57449 - Ingres Failure Walter, I rebooted the server to start. Then tried to recreate the problem causing the server to crash, not it does not crash. However, when I start Live Status, there are no groups listed, and I do have authority under my ID. I can now edit the Group Lists without causing a server crash. 1/3/2002 9:28:23 AM rtrei Walter-- I am confused about this latest status; I thought I had heard from you that group editing would still crash the console. Did you do something more to make that go away? From this input, it sounds like the only problem the customer now has is that LiveStatus does not show his groups. Is that correct? (I have loaded his database to the point where the conversion started, and have determined how this problem occurred. Ultimately, it was due to the fact that somehow several of the *.grp files of the customer had been deleted but the entries not removed fromt he subject table. I don't yet know if this was done manually by the customer, or was done by our code. I'm also checking to see if this occurs on unix as well as NT. Lastly, I want to make sure that there are not additional database edits that we need to make because of thse deleted groups, so I really want to understand where the customer is having problems.) 1/3/2002 8:08:07 PM rtrei most of the problems are fixed. customer still having problems with live status, need help from Jay or Rich to debug. 1/7/2002 11:11:09 AM dbrooks change to MoreInfo per Esc tkt meeting 1/7/02. 1/7/2002 12:24:28 PM wburke Created new ticket # for other issues. - Mostly a poller issue. Spoke with Shep on this, new ticket to be logged. # 58623 1/8/2002 9:35:11 AM rtrei sent an updated script out on Jan 2nd. At this point, I believe customer is up and running, and ultimate fix will be rolled into the next patch. 1/9/2002 1:42:35 PM wburke Customer is up and running OK. Fix scritp worked. back to assigned for patch release. 1/10/2002 10:54:29 AM dbrooks move to field test per esc ticket meeting on 1/10. 1/11/2002 3:00:02 PM wburke -----Original Message----- From: Burke, Walter Sent: Friday, January 11, 2002 2:48 PM To: 'John.Boake@jacobs.com' Subject: Ticket # 57449 -nhiLiveExSrv issue. John, Please confirm if the following has cleared up, since running the fix script: - LiveStatus does not show your groups or groups lists. - LiveExceptions is reporting and sending information to HPOV, - however, No elements appear in subjects to monitor - alarms are generated against non-appearing elements Sincerely, 1/14/2002 11:15:33 AM wburke why is this in field test. Only the script was sent and confirmed to have worked. 2/1/2002 5:20:44 PM hbui Modified sa_upd_r50.sc to handle the situation when putting info from group/grouplist files into the database (nh_group/nh_group_list tables), some of them are missing, but they have entries in nh-subject table. 2/11/2002 2:47:46 PM rhawkes Ha has merged the fix in for 5.5. 2/14/2002 4:07:07 PM hbui Posted one-off on nt,sol, and hp 2/22/2002 2:03:32 PM rtrei resetting to patch 3 because Ha missed the part where we needed to clean up any group members or group list members that didn't exist. (What she put into patch 2 fixed some stuff, but not all.) 3/15/2002 9:02:03 AM rsanginario Passed Tribunal. Marking this as fixed. 12/18/2003 4:34:38 PM rkeville Script modified - updated to nhsCleanGrpDupes.ksh to adhere to Support scripting standards. ======================================= 12/17/2001 3:41:51 PM cestep During install, it does not create a database named "nethealth". From the install log: -------------------------------------------------------------------------------------------------------------------------- Since INGRES uses only lower-case database names, the database name provided 'nethealth' will be translated to 'UIAMUI'. If this is acceptable, enter 'Y' at the following prompt. Otherwise, enter 'N', and no database will be created. Continue? Creating database ' ui mui' . . . Creating DBMS System Catalogs . . . Modifying DBMS System Catalogs . . . Creating Standard Catalog Interface . . . Creation of database ' ui mui' completed successfully. Building the application objects in database 'UIAMUI' for user 'health' . . . Creating the Tables . . . Loading the Initial Data . . . Creating the Table Structures and Indices . . . Creating the Table Structures and Indices for sample tables . . . Granting the Privileges . . . Granting the Privileges on the sample tables . . . Creation of database 'UIAMUI' for user 'health' completed succ< essfully. Updating database with user protocol information ... Protocol information updated successfully .... Failed to create nethealth for user health in ingres. ---------------------------------------------------------------------------------------------------------------------------------------- However, when we use "nhCreateDb nethealth" it completes successfully, and Network Health uses it as the default database. The other problem is that we can not delete the database that was created. When the user cuts and pastes the name, it translates to 'qiuiiamui', and can not find any such database. The customer insists that this is the English version of Network Health, this CD was used in other installations without a problem. The only thing that is different on this one, is a customized install script created by Yulun Zhang, to bypass an iiinstaller space requirement problem. iiinstaller does not compute the partition space correctly on HPUX 11.0, so we must bypass its warning to continue the install. 12/18/2001 9:22:24 AM yzhang Colin, The script (for hp11) you sent to customer is trying to take care the install hang problem, has nothing to do with checking disk space. Problem currently customer has is very strange. Have customer run the attached install script, which has the debug on creating database section. send the install.log, and output of env | grep NH_, as well as tar file for NH_HOME/tmp directory. 12/18/2001 9:35:15 AM cestep Sent a request for the information indicated above. 1/22/2002 3:03:28 PM yzhang As far as I know every thing is running fine on the customer site, only problem is that there is db (from nethealth installation) called 'UIAMUI' can you check the following: 1) ls -l $NH_HOME/idb/ingres/data/default > db.out 2) sql iidbdb 3) select * from iidatabase place 2 and 3 output in a file. 1/23/2002 7:57:38 AM cestep Sent the request for this information to the customer. 2/7/2002 6:19:42 PM schapman The requested information is on Bafs\57000\57682\2.7.02 2/13/2002 2:26:31 PM yzhang requested on saving db, then delete the strange database name 2/13/2002 4:02:16 PM cestep Sent another script to the customer, to try and delete the database. 3/18/2002 11:03:17 AM cestep Talked to the customer, and he agreed that the database was not hurting anything, and was not worth all this investigation. We can continue this if we come across this again, but I feel that this should be closed for now. 3/22/2002 4:38:51 PM yzhang This problem does not affect the normal nethealth performance, customer agree that we can close this ticket 12/19/2001 8:52:50 AM cestep When the customer runs a discovery, it restarts the Nethealth server. Right before the server restarts, we see the following errors in the errlog.log file: E_DMA00D_TOO_MANY_LOG_LOCKS This lock list cannot acquire any more logical locks. The lock list status is 00000000, and the lock request flags were 00000008. The lock list currently holds 700 logical locks, and the maximum number of locks allowed is 700. The configuration parameter controlling this resource is ii.*.rcp.lock.per_tx_limit. The customer is able to reproduce this behavior at will, usually by deleting and discovering elements. They have done it with as little as 10 elements at once with the same results. The Sun machine is an E450 with dual processors and 2 GB of RAM. As a workaround, the customer has increased the log lock limit to 1000. All files are on BAFS, under ticket 56531. 1/2/2002 1:25:33 PM foconnor It appears that the workaround, the customer has increased the log lock limit to 1000, is working o.k. 1/7/2002 9:00:12 PM rtrei marking this as a repeat of 19342. Yulun is currently testing fix I put in. If works, will go out in patch 2. 12/20/2001 9:00:19 AM beta program PWC Global Don Mount donald.d.mount@us.pwcglobal.com 813-348-7252 Scheduled DBMaintenance failure (ORACLE) Log file: more DB_Maintenance.100015.log ----- Job started by User at `12/16/2001 10:30:01 AM`. ----- ----- $NH_HOME/bin/nhDbMaint -u $NH_USER -d $NH_RDBMS_NAME ----- dbuRebuildIndexes: database error: -28650 ORA-28650: Primary index on an IOT can not be rebuilt ----- Scheduled Job ended at `12/16/2001 10:32:05 AM`. ----- -rw-r--r-- 1 neth dba 288 Dec 16 10:32 DB_Maintenance.100015.log In addition, I recieved gaps in reports due to poller failures. 12/20/2001 9:13:34 AM beta program Additional info can be found in Public Folders/All Publice Folders/Engineering/Beta Test/5.5/Beta Sites (Active)/PWC/Bugs (Embedded image moved to file: pic00335.pcx) (See attached file: syslog1217) 1/14/2002 5:38:49 PM rhawkes Reassigning to Gary Pratt, per our discussion. 1/15/2002 12:16:33 PM gmp not enough info to figure out what table is having the problem. please have the customer run the following cmd and send us the output log: ./nhDbMaint -rebuildIndexes -Dm sa -Dfall >& /tmp/db.log 1/15/2002 2:25:07 PM gmp latest email from the customer: Gary, I will do this in the morning. Thanks, Don 1/16/2002 10:31:32 AM gmp the index rebuild is failing on the BSLN0 table since it's an INDEX-ORGANIZED TABLE. the rebuild index query will have to be updated to ignore these types of tables. 1/16/2002 12:27:57 PM gmp chaned the query to only include indexes that have a 'NORMAL' index_type value. 2/26/2002 10:45:07 AM Betaprogram Customer Verified this is FIXED IN BETA 4 1/3/2002 4:38:15 PM wburke Server goes down every night. nhServer start fails to connect to dbms due to lock quota exceeded. 00000a14 Thu Dec 27 11:10:39 2001 E_DM9266_UNLOCK_CLOSE_DB Error occurred unlocking a database during a close. 00000a14 Thu Dec 27 11:10:39 2001 E_DM9267_CLOSE_DB Error occurred closing a database. CBA0CONC::[II\INGRES\778 , 00000a14]: Thu Dec 27 11:10:39 2001 E_DM0087_ERROR_CLOSING_DB Error closing database in server. CBA0CONC::[II\INGRES\778 , 00000a14]: Thu Dec 27 11:10:39 2001 E_SC0122_DB_CLOSE Error closing database. Name: ehealth Owner: ehealth CBA0CONC::[II\INGRES\778 , 00000a14]: Thu Dec 27 11:10:39 2001 E_SC010D_DB_LOCATION Database Location Name: $default Physical Specification: f:\eHealth\oping\ingres\data\default\ehealth Flags: 00000003 CBA0CONC::[II\INGRES\778 , 00000a14]: Thu Dec 27 11:10:39 2001 E_DM0087_ERROR_CLOSING_DB Error closing database in server. 00000a14 Thu Dec 27 11:10:39 2001 E_CL1036_LK_RELEASE_BAD_PARAM LKrelease() failed due to a lock list id bad parameter; the input lock list id = 107; the number of lock lists in the system = 0. 00000a14 Thu Dec 27 11:10:39 2001 E_SX1008_BAD_LOCK_RELEASE Error releasing a lock list. 00000a14 Thu Dec 27 11:10:39 2001 E_SX1007_BAD_SCB_DESTROY Error destroying a SXF session control block. CBA0CONC::[II\INGRES\778 , 00000a14]: Thu Dec 27 11:10:39 2001 E_SX000E_BAD_SESSION_END Error ending a SXF session. CBA0CONC::[II\INGRES\778 , 00000a14]: Thu Dec 27 11:10:39 2001 E_DM003F_DB_OPEN Database is open. 00000a14 Thu Dec 27 11:10:39 2001 E_CL1036_LK_RELEASE_BAD_PARAM LKrelease() failed due to a lock list id bad parameter; the input lock list id = 13; the number of lock lists in the system = 0. 00000a14 Thu Dec 27 11:10:39 2001 E_CL1003_LK_BADPARAM Bad parameter(s) passed to routine. 1/9/2002 1:33:29 PM wburke -----Original Message----- From: Burke, Walter Sent: Wednesday, January 09, 2002 1:20 PM To: 'jirwin@centerbeam.com' Cc: Garcia, Richard Subject: Ticket # 58051 Jonas, We have finished our investigation of this issue. ComputerAssociates states that a lock-down on an active database, especially the transaction log will cause I/o errors and non-recoverable transactions. Although on reboot, the system may work for a couple of hours, once the log reaches the point of the "lost/hung" transaction, it will no lon< ger be able to read or write to this log file. Replacing the logfile and removing backups from any of the ingres directories will alleviate this problem in the future. _________________________ THese erros were caused by a corrupted transaction log. Customer was running system backups while ingres was running, thus corrupting the logfile. - NoBug. (1/4/2002 8:02:25 AM beta program BT Adastral Park: Russell Webb russell.webb@bt.com 01473 607852 I now have beta 3 `installed` but the installation did not create the database. In fact the installation log for CNH gave the following error message: "Checking Saved writable files. Error: You must run query and setup mode before running doit. Cleaning up..." Russ Webb 1/9/2002 11:53:39 AM gdong Does the system have the NH_HOME and ORACLE_SID system variable set? 1/14/2002 12:04:31 PM beta program For 20115 - Does the system have the NH_HOME and ORACLE_SID system variable set? -----Original Message----- From: russell.webb@bt.com [mailto:russell.webb@bt.com] Sent: Monday, January 14, 2002 11:28 AM To: MFintonis@concord.com Subject: RE: 5.5 B2 Install Bug(19878)BT Melissa, Well you can close this one but I've not had any feedback from bug 20115 yet i.e. for beta 3 oracle database creation. Any news? We did get this error on beta1 but I think the guys here just worked around it by rerunning the installion script for CNH. I would rather you could determine the fault otherwise we'll keep getting it on each release. Thanks. Russ 1/14/2002 12:56:18 PM beta program Also, Can you send us both the instehealth.log and nhCreateDb.log thanks! Melissa 1/14/2002 2:10:37 PM beta program Melissa, looking at this log file I can see that it looks as though I didnt opt for 'y' when asked if a database was required. I know I typed in 'y' at the time but somehow it hasnt been recorded !! Russ. <> Can you get the installation script checked at this point please, Melissa? I suspect something maybe a bit dodgy perhaps with the answer that a user gives. Russ. (Forwarded email from site to rlindberg, gdong, & rtrei. Stored email and log attachement in outlook public folder: Eng>Beta Test>5.5>Beta Sites (Active)>British Telecom>Bugs) start installation at Thu Jan 3 17:26:08 GMT 2002 Before installing eHealth, you should make sure that: 1) The account from which you will run eHealth exists You will need to supply the following information: 1) The directory eHealth will be installed in 2) The name of the user that will run eHealth ------------------------------------------------------------------------------ eHealth Location --------------------------------------- Where should eHealth be installed? [/opt/neth] There appears to be a version of eHealth in /opt/neth. You can: 1) stop now. 2) update the eHealth files in /opt/neth. 3) continue, without updating any eHealth files. 4) specify another directory. What is your choice? (1|2|3|4) [2] ------------------------------------------------------------------------------ Online eHealth Guides --------------------------------------- You can install online versions of the eHealth guides and make them accessible to users from the Web interface. (You must have an additional 35MB of disk space in the eHealth installation directory.) Do you want to install the online versions of the eHealth Guides? [y] The online guides will be installed in the /opt/neth/web/help/doc directory. ------------------------------------------------------------------------------ eHealth User --------------------------------------- From which account will you run eHealth? [neth] ------------------------------------------------------------------------------ eHealth Date format --------------------------------------- eHealth can display dates in one of the following formats. 1) mm/dd/yyyy 2) dd/mm/yyyy 3) yyyy/mm/dd 4) yyyy/dd/mm What date format should eHealth use? (1|2|3|4) [2] ------------------------------------------------------------------------------ eHealth Time format --------------------------------------- eHealth can display times in one of the following formats. 1) 12 Hour clock 2) 24 Hour clock What time format should eHealth use? (1|2) [2] ------------------------------------------------------------------------------ Web Reporting Module --------------------------------------- An HTTPD Web server will be installed. Do you want this Web server to start automatically? [y] What port should the Web server use? [80] ------------------------------------------------------------------------------ Oracle Database Table Setup --------------------------------------- You will now be given the option of whether you want your Oracle database to be created and to have its initial load. Do you want the creation of the oracle database to occur? (y|n)? [n] ----------------------------------------------------------------------------- Distributed Console --------------------------------------- Distributed consoles are used only in an eHealth clustered environment. Distributed consoles do not poll and cannot discover elements. For more information, refer to the eHealth Installation Guide. Do you want to install this system as a distributed console? [n] Please select whether you want to install using the small medium large or XLarge model. This choice will determine the set of sizes used to create your tablespaces and tables. Small <= 3,000 elements Medium <= 10,000 elements Large <= 25,000 elements Extra Large > 25,000 elements 1) small 2) Medium 3) LARGE 4) XLARGE Please enter the number of your selection : ----------------------------------------------------------------------------- Database Directories --------------------------------------- Oracle databases require the creation of a number of tablespaces distributed over several disks. In order to create the database, the install program needs to know which directories to create these tablespaces in. eHealth supports between 1 and 9 directories for tablespaces. Each directory must be in a different device. Enter number of directories to use for tablespaces : Enter directory 1 : --------------------------------------- No more questions Take a break! The install will continue for a while (30 minutes or more). Additional time may be needed for database conversion. ********************************************************************* * Interrupting this process will result in an unusable installation * * or database. Please do not attempt to interrupt this without * * first contacting Concord Customer Service. * ********************************************************************* --------------------------------------- Copy eHealth files Moving writable files aside. Copying the eHealth files to /opt/neth. 0% 25% 50% 75% 100% ||||||||||||||||||||||||||||||||||||||||||||||||||| Uncompressing files... 0% 25% 50% 75% 100% ||||||||||||||||||||||||||||||||||||||||||||||||||| Starting eHealth verification checks... eHealth checksums verified successfully. The eHealth files have been successfully copied. Checking saved writable files. Error: You must run query and setup mode before running doit. Cleaning up...  1/14/2002 3:13:41 PM beta program Hi Russ, The default was No, so that means this was an upgrade. Why did you select to create the DB? Was the database not already there? thanks, Melissa 1/15/2002 7:54:14 AM beta program Melissa, Well the database was there but I thought the recommended line of action from beta 2 to beta 3 was to delete it and recreate it, so I deleted it. Unfortunately I maybe didnt delete the database in the correct way i.e.I removed all of oracle too under the guidance of Bob Keville. So I reinstalled Oracle for bet< a3 and chose not to create the database (as it says in your Oracle instructions) and then ran the CNH install script where you say YES to create the database. But it didnt create it. Hope that wasnt too confusing!! Russ. 1/15/2002 10:25:38 AM gdong looks like it is DB remove and recreation issue, reassign to Rob 1/16/2002 1:32:37 PM beta program I would definitely re-run the install at this point. It might make sense for me to call russ directly to work through this. What might be a good time? Rob 1/16/2002 3:21:55 PM mfintonis How bout now, Rob. If I'm here I'm here. If I've gone then I'm not!.....? +44 (0) 1473 607852 1/16/2002 5:23:22 PM rlindberg I've spoken with Russ Webb and he is going to reset his machine to 4.8 to test migration and then re-test the install. I was concerned that he had a bad configuration, but we weren't able to test that. Marking ticket NoDupl. If Russ has problems, we'll re-submit. 1/23/2002 10:58:44 AM Betaprogram Melissa, I have some comments about the CNH 5.5 beta 3 oracle (8.1.7) installation as follows: Pre-installation The only comment here is for item 11 setting your environment variables. Why does the path have to be set as having /usr/ucb at the start? This means wwhen you carry out step 20 of the installation procedure then step 20 has to be done as root because the version of ps (found under /usr/ucb) is the old ps version in Unix, i.e. it does not take the -ef as params. Maybe here in step 11 it is meant to say: unsetenv LD_LIBRARY_PATH setenv LD_LIBRARY_PATH /usr/ucb:$PATH Installation Step 2.5 needed - Need to copy the .gz file from the Oracle CD to the hard disk. You need to unzip it using gunzip, say, and then untar it somewhere. This creates the 'solaris' directory. runInstaller is in the solaris directory. Step 15 when you run root.sh it tells you that ORACLE_SID is not set but ORACLE_HOME is. Step 20 - as I mentioned earlier, you have to do this as root cos the path has /usr/ucb as first on the list. My initial beta test plan so far is as follows: clapton: To test the migration process from 5.02 to 5.5 b3 - system to become part of ehealth cluster falconwood: To test the installation and running of 5.5 b3 from scratch - system to become part of ehealth cluster new machine from concord (not yet delivered here) - Install 5.5 b3 - system to become part of ehealth cluster - distributed console. Russ Webb 1/23/2002 12:00:09 PM wzingher Remarking nodupl. Customer will return with info if still a problem. Y1/7/2002 2:48:58 PM beta program Michael Loewenthal, BETA AE for Equant: Marcia Laing marcia.laing@equant.com 678-346-3772 When installing eHealth and the install script asks to create the DB and you choose yes, we got the error below. You have to cancle the install and then restart and choose not to create DB to continue the install. When running the nhCreateDb command by hand, it looks as though the script runs correctly. -Submitted by Mike Loewenthal, the Beta AE- Oracle Database Table Setup --------------------------------------- You will now be given the option of whether you want your Oracle database to be created and to have its initial load. Do you want the creation of the oracle database to occur? (y|n)? [y] y -------------------------------------------------------------------------- --- Distributed Console --------------------------------------- Distributed consoles are used only in an eHealth clustered environment. Distributed consoles do not poll and cannot discover elements. For more information, refer to the eHealth Installation Guide. Do you want to install this system as a distributed console? [n] read_dbf_sizes: Error: Can not locate dbf_size_data. checkSetupData:Error: Input file "/tmp/nhCreateDbQuery.16101.data" does not exist. -------------------------------------------------------------------------- 1/8/2002 1:53:54 PM beta program changing Severity from High to CRITICAL per Donna Amaral (All Equant Bugs should be marked CRITICAL) 1/8/2002 10:04:15 PM rlindberg This should now be fixed and Mike has the workaround. Marking fixed for B4. I believe Equant took a kit that had install issues. 3/11/2002 2:23:54 PM Betaprogram Email from Mike Loewenthal, BETA AE: Bug 20153: At StateFarm, I was able to choose CreateDB during the install and the install continued. The DB creation failed (for a different reason), but I remember this bug wouldn't even allow me to continue with the install. I believe this bug was for Solaris and if I should test this on Solaris (or NT if it's the NT version), I'll need to test it at a virgin site. 1/9/2002 5:09:20 PM dsionne -----Original Message----- From: Trei, Robin Sent: Tuesday, January 08, 2002 9:29 AM To: Sionne, Domenic Subject: RE: ticket # 57318: Getting erorrs while attempting to upgrading from Network Health 4.7.1 to eHealth 4.8 CA ticket # 11607726/1 (not in our Remedy tickets) > -----Original Message----- > From: Sionne, Domenic > Sent: Tuesday, January 08, 2002 9:23 AM > To: Trei, Robin > Subject: RE: ticket # 57318: Getting erorrs while > attempting to upgrading from Network Health 4.7.1 to eHealth 4.8 > > Hi Robin, > Thanks. What is the Prob ticket #. I need to associate it > with the call ticket. > > best regards, > Domenic > > -----Original Message----- > From: Trei, Robin > Sent: Monday, January 07, 2002 8:13 PM > To: Sionne, Domenic > Subject: RE: ticket # 57318: Getting erorrs while > attempting to upgrading from Network Health 4.7.1 to eHealth 4.8 > > OK, I've created a ticket. It will take 2-3 days for the > discussions to get interesting, > so please ping me on Wednesday morning. > > > -----Original Message----- > > From: Sionne, Domenic > > Sent: Monday, January 07, 2002 6:59 PM > > To: Trei, Robin > > Subject: FW: ticket # 57318: Getting erorrs while > > attempting to upgrading from Network Health 4.7.1 to eHealth 4.8 > > > > > > > > -----Original Message----- > > From: Trei, Robin > > Sent: Monday, January 07, 2002 10:20 AM > > To: Sionne, Domenic > > Subject: RE: ticket # 57318: Getting erorrs while > > attempting to upgrading from Network Health 4.7.1 to eHealth 4.8 > > > > END of day. That will be around 9 tonight. Can you put a > > timer function on your email? > > > > > -----Original Message----- > > > From: Sionne, Domenic > > > Sent: Monday, January 07, 2002 10:19 AM > > > To: Trei, Robin > > > Subject: RE: ticket # 57318: Getting erorrs while > > > attempting to upgrading from Network Health 4.7.1 to eHealth 4.8 > > > > > > $ ping robin > > > > > > Pinging robin [ticket 57318] with 32 bytes of data: > > > > > > :^ ) > > > -----Original Message----- > > > From: Trei, Robin > > > Sent: Friday, January 04, 2002 5:15 PM > > > To: Kaufman, Joel; Sionne, Domenic > > > Subject: RE: ticket # 57318: Getting erorrs while > > > attempting to upgrading from Network Health 4.7.1 to eHealth 4.8 > > > > > > ok, I forgot. Talked with Dominic. He is going to ping me to > > > be sure I've talked with them by end of day on Monday. > > > > > > > -----Original Message----- > > > > From: Kaufman, Joel > > > > Sent: Friday, January 04, 2002 4:55 PM > > > > To: Trei, Robin; Sionne, Domenic > > > > Subject: RE: ticket # 57318: Getting erorrs while > > > > attempting to upgrading from Network Health 4.7.1 to eHealth 4.8 > > > > > > > > I don't have any additional info Robin. The only one > > > > who would drive a change from CA would be you. Last time I > > > > mentioned this to you, you said you would ask CA if they had > > > > a work around for this issue. > > > > > > > > Joel > > > > > > > > > > > > << OLE Object: Picture (Metafile) >> > > > > Joel D. Kaufman > > > > Engineering Product Manager > > > > Concord Communications, Inc. > > > > P. 508.303.4237 > > > > F. 508.303.4344 > > > > email. jkaufman@concord.com > > > > > > > > > > > > > > > > -----Original Message----- > > > > From: Trei, Robin > > > > Sent: Friday, January 04, 2002 4:33 PM < > > > > To: Sionne, Domenic > > > > Cc: Kaufman, Joel > > > > Subject: RE: ticket # 57318: Getting erorrs while > > > > attempting to upgrading from Network Health 4.7.1 to eHealth 4.8 > > > > > > > > I was aware that there was a problem, but not that a ticket > > > > had been logged > > > > with CA. In 4.8 we upgraded to the new version of ingres II > > > > (because the > > > > old one had become so obsolete it was not supported) > and went to > > > > their new licensing mechanism. That requires read/write privs > > > > in /usr/local. > > > > > > > > As far as I knew this was a hard fact. If someone from > > > > Concord is driving > > > > this issue to get CA to make a change, I am unaware of it. > > > > > > > > Joel, do you know more? > > > > > > > > > -----Original Message----- > > > > > From: Sionne, Domenic > > > > > Sent: Friday, January 04, 2002 3:33 PM > > > > > To: Trei, Robin > > > > > Subject: ticket # 57318: Getting erorrs while attempting > > > > > to upgrading from Network Health 4.7.1 to eHealth 4.8 > > > > > > > > > > Hi Robin, > > > > > > > > > > Our customer Morgan Stanley cannot upgrade from 4.7 to 4.8. > > > > > In this case, the issue appears to be caused by their > > > > > implementation of Kerberos on all their systems. > > > > > In a nutshell, they are performing the upgrade, not as native > > > > > root, but a root-like account defined in Kerberos. > > > > > Also the permissions on /usr/local are read only. > > > > > > > > > > Our response to them was that permissions on /usr/local must > > > > > be read/write and that we do not support any root but the > > > > > native OS root. > > > > > > > > > > They can perform a fresh install of 4.7.1, but not 4.8. > > > > > > > > > > Apparently, something has changed with Ingres as well and > > > > > this is why I am emailing you. > > > > > The AE, John Nadasi, says that he and Rob Jarivis were > > > > > working with engineering: Jeff Martin and Dave Andrews and > > > > > that they're was to be a problem ticket opened with CA and a > > > > > bug opened up on our end. > > > > > > > > > > Today, John mentioned that I should speak to you. > > > > > Are you aware of this? Have we a problem ticket assigned to > > > > > this issue? > > > > > > > > > > best regards, > > > > > Domenic > > > > > ================================== Customer sensitivity: Our customer Morgan Stanley cannot upgrade from 4.7 to 4.8. In this case, the issue appears to be caused by their implementation of Kerberos on all their systems. In a nutshell, they are performing the upgrade, not as native root, but a root-like account defined in Kerberos. Also the permissions on /usr/local are read only. Our response to them was that permissions on /usr/local must be read/write and that we do not support any root but the native OS root. They can perform a fresh install of 4.7.1, but not 4.8. 1/10/2002 10:55:11 AM dbrooks no bug per esc ticket meeting 1/10. 1/11/2002 10:55:17 AM beta program Alcatel USA, Inc Reinhard Pfaffinger 972 519 4943 rpfaffin@usa.alcatel.com Jeff Beck (NPI) onsite testing 1/9 - 1/10. Scheduler. nhSchedule randomly hangs after 5.02>5.5 migration. Also will hang console on occasion. 1/15/2002 12:07:59 PM shonaryar Asked jeff to send back some tar files which are needed for migration. saeed 1/16/2002 9:04:37 AM beta program Hi Jeff, we need more info, please let me know if you can get Saeed these files, thanks Melissa -----Original Message----- From: Honaryar, Saeed Sent: Wednesday, January 16, 2002 8:13 AM To: Fintonis, Melissa Subject: RE: Beta bug 20296 I didn't get the files yet saeed 1/16/2002 9:41:24 AM beta program -----Original Message----- From: Fintonis, Melissa Sent: Wednesday, January 16, 2002 8:20 AM To: Beck, Jeff Cc: Betaprogram Subject: FW: Beta bug 20296 Hi Jeff, we need more info, please let me know if you can get Saeed these files, thanks Melissa -----Original Message----- From: Beck, Jeff Sent: Wednesday, January 16, 2002 9:08 AM To: Fintonis, Melissa Cc: Betaprogram Subject: RE: Beta bug 20296 Saeed asked me for these files yesterday afternoon. Unfortunately Reinhard removed everything from the system and re-installed clean after I left. So no files for you. jeff -----Original Message----- From: Fintonis, Melissa Sent: Wednesday, January 16, 2002 9:31 AM To: Beck, Jeff; Honaryar, Saeed Cc: Betaprogram Subject: RE: Beta bug 20296 Bummer! so where do we go with this ticket from here? 1/16/2002 9:57:34 AM beta program -----Original Message----- From: Tisdale, Lorene Sent: Wednesday, January 16, 2002 9:35 AM To: Fintonis, Melissa; Beck, Jeff; Honaryar, Saeed Cc: Betaprogram Subject: RE: Beta bug 20296 Ask Reinhard if this is still a problem. If not, close the ticket. -----Original Message----- From: Beck, Jeff Sent: Wednesday, January 16, 2002 9:38 AM To: Tisdale, Lorene; Fintonis, Melissa; Honaryar, Saeed Cc: Betaprogram Subject: RE: Beta bug 20296 He didn't run a migration this time though, so his environment now is not the same as what we had when it was a problem. 1/16/2002 10:10:35 AM beta program so now what? 1/17/2002 9:22:17 AM shonaryar I am closing this bug since I have no way to produce it. saeed 1/18/2002 11:05:31 AM beta program status should be NoDupl 1/18/2002 3:00:45 PM beta program . 11/11/2002 11:01:47 AM beta program Alcatel USA, Inc Reinhard Pfaffinger 972 519 4943 rpfaffin@usa.alcatel.com Jeff Beck (NPI) onsite testing 1/9 - 1/10 Usage syntax for nhCreateDb is wrong. gives you one syntax then when you use that syntax it gives you a different syntax. Also the migration document give no syntax. 1/14/2002 5:03:47 PM rlindberg change usage to this: Usage: nhCreateDb [-h] The program is used to create the database. It will question the user for all information required. The -h option gives you this help message. or this for UNIX Usage: nhCreateDb [-h] The program is used to create the database. It will question the user for all information required. The -h option gives you this help message. It must be run as 'root' 1/11/2002 1:16:09 PM beta program Unisys David Lerchenfeld 734-737-7202 david.lerchenfeld@unisys.com When attempting to disable a schedule job got an Oracle error msg: nhSchedule -disable 100003 Error: Database error: (ORA-01552: cannot use system rollback segment for non-system tablespac). 1/15/2002 11:35:49 AM rhawkes -----Original Message----- From: Tikku, Sanjay Sent: Friday, January 11, 2002 3:22 PM To: Trei, Robin; Venuto, Donna; Lindberg, Rob; oracle_port Cc: Hawkes, Richard Subject: RE: ProbT0000020302 has been submitted. From Sanjay: This tells me that the SYSTEM rollback segment was ONLINE. It should be offlined after the database has been created and new rollback segments have been brought onlime. This is a standard thing to do when creating Oracle databases. 1/17/2002 1:56:10 PM rhawkes Unisys believes that their DBA did some work to cause this to happen. 2/26/2002 11:06:45 AM Betaprogram Customer Verified B1/15/2002 5:42:48 PM schapman Equant (France)had a power failure that caused the database to go inconsistent. In trying to save the database after forcing it, they got the following error: ehnce01% nhSaveDb -p savedb.tdb nethealth See log file /opt/eh/log/save.log for details... Begin processing 01/11/2002 15:09:39. Copying relevant files (01/11/2002 15:09:40). Fatal Internal Error: Unable to execute 'COPY TABLE nh_stats1_1004486399 () INTO '/opt/eh/savedb.tdb/nh_stats1_1004486399'' (E_RD0060 Cannot access table information due to a non-recoverable DMT_SHOW error (Fri Jan 11 10:09:57 2002) ). (cdb/DuTable::saveTable) We tried to drop the table from the database using the verfydb command verifydb -mrun -sdbname $NH_RDBMS_NAME -odrop_table nh_stats1_1004486399 This was not successfull ehnce01% verifydb -mrun -sdbname< $NH_RDBMS_NAME -odrop_table nh_stats1_1004486399 S_DU04C4_DROPPING_TABLE VERIFYDB: beginning the drop of table nh_stats1_1004486399 from database nethealth. Aborting because of error E_QE007C Error trying to get a record. Associated error messages which provide more detailed information about the problem can be found in the error log (errlog.log) (Mon Jan 14 05:54:03 2002) I talked to Yulun and he wanted me to log a bug escalate it and ask the customer attempt to drop the offending table from the rollup boundry and reattempt the save. We will have the feedback from the customer on 1/16 AM due to the time difference. Related files are on BAFS 1/15/2002 5:48:46 PM yzhang request support have customer removing the entriy for that stats1 table from nh_rlp_boundary before dbsave. 1/17/2002 12:04:46 PM yzhang don't use verifydb drop table. please remove the entry from nh_rlp_boundary table by running sql like: delete from nh_rlp_boundary where rlp_type = 'ST' and rlp_stage_nmbr = 0 and max_range = the time stamp for this stats1 table. Can you make this one as the highest priority. this escalated and customer is down. Thanks Yulun 1/17/2002 5:29:25 PM yzhang Can you have customer try the following before our call: the problem table is nh_stats1_1004486399, right stop nhServer, and ingres, then start ingres and nhserver nhForceDb nethealth > force.out echo "help table nh_stats1_1004486399\g" sql nethealth > stats1.out send the errlog.log collect nhCollectCustData (see if they can do this) 1/21/2002 12:10:10 PM yzhang I have requested Sandrine do the database save, then send us the save.log 1/22/2002 1:34:47 PM jnormandin - Re-requested info 1/23/2002 11:48:25 AM yzhang the db save was succeeded, make sure to keep this saved database in the safe place. follow the steps for destroy db 1) remove nethealth physical files (as ingres) 2) recycle ingres (as ingres) 3) destroydb nethealth (as nh_user) Josan Can you write a detail steps for her, and cc to me. Sandrine, don't do this untill you get the detail steps from Josan Thanks Yulun 1/23/2002 12:20:44 PM yzhang use this, I modified from yours login as ingres cd to $II_SYSTEM/ingres/data/default/nethealth rm * Stop the ingres db ingstop (if this fails use ingstop -force) restart the ingres db ingstart login as nhuser destroydb nethealth note they have no iidbdb, and need to use destroydb, not nhDestroyDb 1/25/2002 9:37:07 AM yzhang I think we have Sandrine remove the nethealth under .../data/default so it should looks empty. do the following first: 1) login as ingres 2) ingstop -force 3)ingstart 4)login as nhuser 5)destroydb nethealth after this send errlog.log under /opt/eh/ingdb/ingres/data/files, ingprenv >ingprenv.out (Jason asked this two) What is the problem for nh_element table? 2/4/2002 11:30:11 AM yzhang issue solved A1/16/2002 1:14:45 PM beta program Unisys Mail Stop MSE7-119 Unisys Way Blue Bell, PA 19424 David Lerchenfeld 734-737-7202 david.lerchenfeld@unisys.com Ingres DB devkmen 1/16/2002 1:19:32 PM beta program Lee, Could you please call Dave Lerchenfeld from Unisys at 734-737-7202. He's in the process of "trying" to run through the Migration Testing. He's stuck somewhere with Ingres DB DEVKMEN? He's an HP11 site. Ticket #20406 Thanks! Lorene Tisdale Program Management Beta Team T: 508-486-4518 F: 508-481-9772 1/16/2002 1:20:27 PM beta program Lorene, Saaed should be calling the customers for Migration, not Lee. Lorene, please open a problem ticket for Saaed to track. Thanks, Donna 1/17/2002 8:26:35 AM mfintonis Email String: Folks; Unisys is back up and on line.... SUMMARY: -------------- HISTORY: This site installed 5.5 B3 then removed it to re-install 4.8. They cleaned the system by removing nh.install.cfg and the eHealth directory. ISSUE: Ingres is giving the user DB DEVKMEN error messages during the installation of 4.8. SOLUTION: ----------------- Stopped the Trapexploder Cleaned the system removing the following files; /sbin/init.d/trapexploder /sbin/init.d/httpd.sh /sbin/init.d/nethealth.sh /sbin/rc*.d/ removed all the links to trap, httpd.sh, and nethealth.sh files /etc/nh.install.cfg (verified it was removed) /ca_lic (removed the entire directory) removed $NH_HOME rebooted the system and re-install 4.8 - Regards; - Lee M. LoPilato Lee M. LoPilato Principal Software QA Engineer Concord Communications, Inc. 600 Nickerson Road Marlboro, MA 01752 P: (508) 303-4275 F: (508) 303-4344 -----Original Message----- From: Amaral, Donna Sent: Wednesday, January 16, 2002 3:37 PM To: Honaryar, Saeed; Lopilato, Lee Cc: Tisdale, Lorene; Venuto, Donna; Hawkes, Richard; Lindberg, Rob; Trei, Robin Subject: RE: HELP: 5.5 Stand Alone Site (Unisys) Saeed, Yes. Lee can try to troubleshoot. Please assign problem ticket to Lee. If he is unable to solve, he can escalate. Donna -----Original Message----- From: Trei, Robin Sent: Wednesday, January 16, 2002 2:31 PM To: Honaryar, Saeed; Tisdale, Lorene; Venuto, Donna; Hawkes, Richard; Lindberg, Rob; Amaral, Donna Subject: RE: HELP: 5.5 Stand Alone Site (Unisys) I will call if necessary. However, can we get Lee Lopilato (preferred) or Yulun to take a stab first? I thought Lee was handling some of the initial beta problem contacts. Donna A-- is this a decision you can work? > -----Original Message----- > From: Honaryar, Saeed > Sent: Wednesday, January 16, 2002 1:34 PM > To: Tisdale, Lorene; Venuto, Donna; Hawkes, Richard; > Lindberg, Rob; Trei, Robin > Subject: RE: HELP: 5.5 Stand Alone Site (Unisys) > > I just talk to Dave Lerchenfeld problem is not migration. > He installed nethealth 48, he is trying to load the data base > with backup DB and he is running to problem. > I will let Robin know to contact him > > saeed > > -----Original Message----- > From: Tisdale, Lorene > Sent: Wednesday, January 16, 2002 1:10 PM > To: Honaryar, Saeed > Subject: HELP: 5.5 Stand Alone Site (Unisys) > > Saeed, > > Could you please see the below detail and call this 5.5 Beta > site ASAP. > > Ticket #20406 > > Thanks! > -----Original Message----- > From: Amaral, Donna > Sent: Wednesday, January 16, 2002 1:08 PM > To: Tisdale, Lorene; Lopilato, Lee > Cc: BetaGroup > Subject: RE: HELP: 5.5 Stand Alone Site (Unisys) > > Lorene, > Saaed should be calling the customers for Migration, > not Lee. Lorene, please open a problem ticket for Saaed to track. > Thanks, > Donna > > -----Original Message----- > From: Tisdale, Lorene > Sent: Wednesday, January 16, 2002 1:06 PM > To: Lopilato, Lee > Cc: BetaGroup > Subject: HELP: 5.5 Stand Alone Site (Unisys) > > Lee, > > Could you please call Dave Lerchenfeld from Unisys at > 734-737-7202. He's in the process of "trying" to run through > the Migration Testing. He's stuck somewhere with Ingres DB > DEVKMEN? He's an HP11 site. > > Ticket #20406 > > Thanks! > Lorene Tisdale > Program Management Beta Team > T: 508-486-4518 > F: 508-481-9772 1/17/2002 8:38:18 AM mfintonis Yes ... this can be closed. The customer did not properly clean the HP system prior to the installation of 4.8, this was in preparation to perform migration testing to 5.5. -Closing as a NoBug -Melissa 2/26/2002 11:10:26 AM Betaprogram Customer Verified this appears to be an operational problem. 1/16/2002 3:36:14 PM beta program SEAGATE: Migration document does not indicate if eHealth should be shut down. Please advise. 1/16/2002 3:47:30 PM shonaryar I contacted Joseph Madi and answerd all his question. It is better to shutdown all ehealth services before migration or otherwise new installation will kill them all by default saeed 1/23/2002 2:18:21 PM Betaprogram This is NoBug. They just needed some advice. saeed 51/16/2002 3:44:45 PM beta program CSC: -Submitted by CCRD Support person onsite: Bob < Keville- CCRD: CSC: Bob Keville Victor Wiebe rkeville@concord.com vwiebe@csc.com 508-303-4385 (302) 391-8862 If you run the command nhDestroyDb with the argument ehealth, the old syntax, the $NH_HOME dir is removed. This is a problem on NT and 2000 as this leaves the registry entries behind. This forces you to remove the NuTCracker entries manunaly with regedit. This can easily become a problem for customers who are used to our old nhDestroyDb syntax. 1/18/2002 4:47:22 PM rlindberg we pulled out all the code to remove NH_HOME from nhdestroyDb. it was inappropriate. The command now takes no arguments. If an argument is supplied, it is ignored. It removes the DB associated with ORACLE_SID/NH_RDBMS_NAME ^1/16/2002 5:07:35 PM beta program SEAGATE: Lee will fill in 1/16/2002 5:38:30 PM llopilato BETA ACCOUNT : SEAGATE SUPPORTING AE : Joseph Madi BETA TESTING : MIGRATION Joseph installed and ran 5.0.1 without issue. His testing required him to install 5.5 B3 in prep for the migration test suite. It was during the installation of 5.5 B3 that he allowed the installation script to kill all of the 5.0.1 processes running. During the actual installation at the 21% install mark (While creating eHealth Services, for 5.5) a Service function Create eHealth failed Error message poped up and the installation was terminated. WORK AROUND: Verified that all of the 5.0.1 services (Except Ingres) were terminated/stopped then restarted the 5.5 B3 install. QUESTION: IS this documented ? Should we allow the 5.5 install kill 5.0.1 processes? What is the right approach here? SQA COMTACT: Lee M. LoPilato @ EXT. 4275 1/17/2002 8:26:58 AM rbonneau Lee got the AE past the license issue, however, Lee is noting that the install appears to be confusing which processes it needs to kill (between 5.0 and 5.5). Please see Lee for further details - reassigning to Steve. 1/17/2002 9:16:33 AM smcafee Not sure what the problem really is here. Can you investigate tracy? 1/17/2002 7:27:40 PM tfang Run a 5.0.1-> 5.5 (wanda mainline) migration install on my test machine (NT). It does not happen. Will try it on Win2k after fixing nutc problem, probably next Monday or Tuesday. 1/22/2002 5:53:13 PM tfang Could not reproduce the problem on Win2K (VKumar-opt borrowed from Xia) either (5.0.1 -> 55b3 and 55 01/22). Left a message to Joseph to see if we can get more info about the customer's system. 1/23/2002 6:32:01 PM tfang Talked with Steve, tentatively to Close as NoDupl. 1/24/2002 5:33:04 PM tfang Close as NoDupl. 1/18/2002 8:44:19 AM beta program UNISYS: David Lerchenfeld david.lerchenfeld@unisys.com (734)737-7202 When performing the ingres to oracle migration steps, the step that reconfigures the web server using nhiHttpdCfg cause a memory Fault Core Dump, output follows. nhiHttpdCfg -user $NH_USER -grp users -nhDir $NH_HOME -cfg users.cfg > /nethealth5.5/web/httpd/httpd.conf Memory fault(coredump) 1/18/2002 11:12:23 AM fmali This is fixed in beta 4 as Remedy 20174. Current problem comes from beta 3 thus not having the fix. See 20174 for more information. -Thanks Florent - 2/26/2002 11:11:22 AM Betaprogram Customer Verified this is Fixed in BETA 4 B51/21/2002 8:42:41 AM cestep The conversation rollup fails with the following error messages (seen in the NT event log): The description for Event ID ( 12028 ) in Source ( NuTCRACKER 4 ) could not be found. It contains the following insertion string(s): 2, MEM_OUT_OF_MEMORY, Application, [nhiDialogRollup.exe (.\malloc.c:1300) PID=499 TID=404]. The description for Event ID ( 1004 ) in Source ( Network Health ) could not be found. It contains the following insertion string(s): nhiDialogRollup.exe: Internal Error: Unexpected Null value for 'C++ constructor' (possibly out of memory). (nwb/newFailed). Even if we choose high rollup values of 10 weeks for As polled, hourly, daily and weekls samples (which is long before the 12/06/2001 where the nodes have been intorduced to the database) the rollup fails with the same message. The rollup will run for about 10 minutes and than fail, using 350 MB memory. The total mnemory usage is 800 to 900 MB, at least 100 MB RAM and alot of swap space (3GB) was free. 1/25/2002 9:18:57 AM schapman -----Original Message----- From: ICS Product Support [mailto:support@ics.de] Sent: Thursday, January 24, 2002 11:36 AM To: ts_mgrs@concord.com Subject: [Fwd: Ticket #58821: ICSREQ003764: Memory error during rollup] Hello, please escalate this call to engineering immediately! We had major problems with the TA product due to 2 million nodes in the database . After several weeks we managed to fix this problem but still the conversation rollup is not working and there is a danger that this will cause new problems. If any further problems occur with TA, the customer will certainly throw out this product. Martin Hinsberger-Heintz Support Manager ICS 1/25/2002 11:28:58 AM yzhang Colin, the rollup.out is too big for me to open it, can you send me the second half of the file This is the new problem we saw with conversation rollup reason is they have too much nodes. I need the following: 1) output of nhDbStatus 2) output of nhCollectCustData Thanks Yulun 1/25/2002 11:45:31 AM cestep Did a tail on the rollup.out file and saved to tail.out. Also requested the additional information from the customer. 1/28/2002 11:08:04 AM yzhang still wating for Dbstatus and nhCollectCustData 1/28/2002 12:19:26 PM cestep Received the files from the customer. On BAFS, under 58821. 1/28/2002 12:19:35 PM cestep moving to assigned. 1/29/2002 1:41:09 PM cestep Any updates on this one? ICS keeps asking me for an update on the status of this. 1/30/2002 8:28:48 AM cestep Received more information from the reseller: Throughout the troubleshooting for this problem, Markus has run a series of nhDbStatuses. He has seen the DB size shrink, and he noticed that the date on the last rollup continues to update. He began running reports with the customer and saw that there was decreasing granularity over time - meaning to him that the rollups are indeed occurring, even though the error message still exists. 1/30/2002 5:01:43 PM yzhang Colin, this is for problem 20502, they have ingres stack dump problem. Please send following attached zip file (that includes nhiReport and nhiStdReport) to customer, have them backup the original nhiReport and nhiStdReport from bin/dyd directory, then use the attached one. then do whatever they suppose to do. 1/31/2002 3:22:51 PM mwickham -----Original Message----- From: Wickham, Mark Sent: Thursday, January 31, 2002 11:08 AM To: Zhang, Yulun Cc: Estep, Colin Subject: Problem Ticket 20502 Yulun, Yesterday, you sent Colin a zip file containing nhiReport and nhiStdReport. The customer is hesitant to use these new files since this is the same machine that had experience the issue with problem ticket 18326. Do the files you provided yesterday address both issues, problem ticket 20502 AND 18326? Please let us know - thank you, Mark 2/1/2002 12:34:31 PM yzhang created an issue with CA 2/1/2002 2:31:16 PM yzhang I created an issue with CA, they need the following: 1) config.dat 2) ingres patch level ( get a file from customer called version.rel and patch.doc from ingres directory) 2/5/2002 10:59:12 AM yzhang I need the config.dat and ingres patch level from customer as soon as possible 2/5/2002 11:34:02 AM yzhang Linda, here are the additional information you requested. let me know as soon as possible if you have any solution. Thanks Yulun 2/6/2002 12:14:20 PM yzhang Jason, there is the procedure you need to work with customer. bascially they need to turn group buffer to 0 and double the ii.netsys.dbms.*.stack_size from 131072 (current) to 262144 do a practice on your machine before instructing customer login as ingres source net*.csh cbf DBMS Server, F1 Config Let the highlight as where it is F1 Cache Highlig< ht 2k ,F1, Type Con to set dmf_group_site 0 F1, End Highlight DMF Cache 4K, F1, type Con to set dmf_group_site 0 F1, End Highlight DMF Cache 8k F1, type Con to set dmf_group_site 0 stop and restart ingres again Don't do angthing with 16k, 32k, and 64k. 2/7/2002 10:38:54 AM yzhang Can you check with customer to see if changing the ingres configration works Yulun 2/7/2002 10:41:00 AM jnormandin Hey Jason, we modified the cahce and increased the stack_size. Nothing helped. What now ...? regards, I just received and update. The changes did not resolve the issue. 2/7/2002 10:50:30 AM yzhang send the new config.dat, errlog.log, and find out what exactly the problem they have after doing the configuration change 2/12/2002 8:47:47 AM jnormandin Requested information saved to : \\Bafs\Escalated Tickets\58000\58821\2.12.02\ICSREQ003764_120202 2/12/2002 11:52:01 AM yzhang Jason, here is what we need to do for this customer: 1) you work with customer to make sure all the group_size for 2k to 16K reset to 0 2) have customer run ~yzhang/scripts/cleanNodes_pwc.sh just by typing the script name, after finish, send concord.out, it located under the directory wherre they run the script Thanks Yulun 2/19/2002 10:02:38 AM jnormandin the customer set the values in the config.dat and ran the cleanNodes_pwc.sh script. This didn't change the behavior of the rollup. All relevant information found on 58821\2.19.02 *note: dmf_group_size is set properly to 0 as requested. 2/21/2002 9:59:47 AM yzhang Do you know how it takes to see the out of memory error from dtarting conversation rollup? 2/21/2002 11:50:34 AM yzhang this is the problem about out of memery error from conversation ruollup. Can you tell me what swap space you have. Also can you provide me with all available information regarding your memery and disk space. As far as I know you currently only have 512 MG swap space, this is apparently not enough for doing the conversation rollup. Is it possible you can add another 1 G swap space. Thanks Yulun 2/21/2002 4:08:45 PM mwickham Requested memory information from customer. 2/22/2002 1:16:57 PM jnormandin From: ICS Product Support [mailto:support@ics.de] Sent: Friday, February 22, 2002 11:35 AM To: Support Subject: Re: ICSREQ003764 Call 58821: Conversation Rollup Support, here again (was also sent to Yalun) > > Support > > Currently we are awaiting the following information requested > yesterday by our Database developer Yulun Zhang: > > 1- Are we correct in our records that the customer has > 512 Meg swap configured ? 3 GB > 2- If so, would it be possible for the customer to add > another Gig to that amount > 3- Please forward the output of the nhDbStatus command attached > 4- How long after the rollups begin does the out of memory > error occur ? > 5-10 minutes Stephan 2/22/2002 1:59:28 PM yzhang requested customer db so that we can reproduce the problem in house 2/25/2002 11:09:51 AM dbrooks see above 3/1/2002 11:19:28 AM rkeville de-escalate until db is obtained. 3/12/2002 3:56:37 PM cestep Received the database. I loaded 4.8 P6 on Windows NT. Loaded the database save. First, ran nhiRollupDb - no errors. Then, nhiDialogRollup - no errors. From the machine: C:\nethealth\bin\sys>nhiDialogRollup Begin processing (3/12/2002 03:08:50 PM). End processing (3/12/2002 03:21:55 PM). Checked the tables in the database, that verifies the rollup worked. Files on BAFS under 58821/in-house 3/27/2002 9:38:06 AM yzhang Check the following on your in house system, then compare to customer's system: swamp space all memory information and disk space information check the conversation rollup schedule, such as how long you keep as polled data, 4 hour data, 1 day data, if customer keeps longer data see if they can set it back to default. Thanks Yulun 3/28/2002 10:24:25 AM cestep From the output of nhDbStatus: Database Name: nethealth Database Size: 2372648960.00 bytes RDBMS Version: II 2.0/9808 (int.wnt/00) Location Name Free Space Path +-------------------+------------------+---------------------------------+ | ii_database | 3553580000.00 bytes | E:\nethealth\oping | +-------------------+------------------+---------------------------------+ Statistics Data: Number of Elements: 1302 Database Size: 959258624.00 bytes Location(s): ii_database Latest Entry: 28/1/2002 09:46:59 Earliest Entry: 30/3/2000 01:00:00 Last Roll up: 26/1/2002 20:01:44 Conversations Data: Number of Probes: 23 Number of Nodes: 36338 As Polled: Database Size: 35422208.00 bytes Location(s): ii_database Latest Entry: 28/1/2002 09:30:00 Earliest Entry: 22/1/2002 11:34:45 Last Roll up: 27/1/2002 04:05:33 Rolled up Conversations: Database Size: 191315968.00 bytes Location(s): ii_database Latest Entry: 21/1/2002 10:45:00 Earliest Entry: 9/12/2001 00:21:55 Rolled up Top Conversations: Database Size: 181157888.00 bytes Location(s): ii_database Latest Entry: 21/1/2002 10:45:00 Earliest Entry: 11/11/2001 00:21:04 ----------------------------------------------------------------------------- Machine info: - Windows NT 4.0 SP6 - 1024 MB of RAM - over 3 GB of swap space I will find out what the schedule is for the rollups. 4/2/2002 5:12:41 PM yzhang the following nhDbStatus is from running nhDbstatus on your system, not customer, right? You really need to compare the the rollup schedule (I mean how you keep the stats0 , and stats1 data) 4/12/2002 8:17:52 AM cestep -----Original Message----- From: ICS Product Support [mailto:support@ics.de] Sent: Wednesday, April 10, 2002 10:40 AM To: Support Subject: Re: Ticket #58821 - Error during dialog rollup / ICSREQ003764 Colin, they keep the as polled conversations for 5 days. Best Regards, Siegi 4/19/2002 6:27:07 PM yzhang I still think the problem is due to the difference between customer's system and your in house system. Can you run the following script for both your in house machine and customer machine, then get the output, check with Jason if you have question regarding running the script 5/2/2002 6:42:22 PM yzhang Any luck getting the following information I still think the problem is due to the difference between customer's system and your in house system. Can you run the following script for both your in house machine and customer machine, then get the output, check with Jason if you have question regarding running the script 5/6/2002 7:46:41 AM cestep Requested information again. 5/9/2002 10:27:19 AM cestep Received the output of nhInfo, but the reseller had modified the script and it was missing some system information (RAM, Swap). Got a newer version of the script from Jason and sent this to the customer. Waiting for the results. 5/21/2002 10:43:36 AM mwickham Customer's provided the output of the script...located on BAFS in \escalated tickets\58000\58821\21May02\systemSpecs.log 5/21/2002 12:03:05 PM yzhang I looked at the new information regarding customer's NT. It looks to me that they have good disk space, memory and swap space. Colin, I think you have the customer db in house, and you did not reproduce customer's problem. do you still have the db loaded on a NT in house, I want to take a look. the other thing you need to do is to compare the system message from customer to the machine where Colin did the test. see if there is any big difference that might causes the problem. Thanks Yulun 6/13/2002 7:54:54 AM foconnor Node Address Pairs 96888 Nodes 95245 First conversation rollup is the one that fails the subsequent rollups run succesfully 6/13/2002 7:57:46 AM foconnor NETHEALTH VERSION Network Health version: 4.8.0 D0 - Patch Level: 8 RAM Physical Memory (RAM): 1048MB SWAP SPACE Total Paging File Space:< 3100MB 8/5/2002 10:34:55 AM jkuefler See also 22900 (PWC). I will be tracking the changes to this problem there and will leave this ticket open until the patch or one-off is safely delivered to this particular customer. It will be several weeks before this can be resolved in-house. No estimate for how long it will take to get in the field. 8/14/2002 11:59:51 AM pkuehne See call ticket customer is no longer seeing the problems. They dropped TA data and moved the machine F 1/21/2002 9:13:57 AM nalarid The customer has received the following error specifically on Solaris machines, both after moving from Patch 06 to Patch 07 on Version 4.8, and also after an upgrade to 5.0.1 from 4.8 P07: /appl/eh/db/ckpsave.tdb/ingres/dmp/default/nethealth filename:aaaaaaaa.cnf is created with wrong permissions. No data was written to database, errlog.log shows following errors: open() failed with operating system error 13 (Permission denied) SNWHT01 ::[50725, 000000d0]: Mon Nov 19 08:23:14 2001 E_DM9004_BAD_FILE_OPEN Disk file open error on database:nethealth table:Not a table pathname:/appl/eh/db/ckpsave.tdb/ingres/dmp/default/nethealth filename:aaaaaaaa.cnf open() failed with operating system error 13 (Permission denied) SNWHT01 ::[50725, 000000d0]: Mon Nov 19 08:23:14 2001 E_DM923B_CONFIG_CLOSE_ERROR Error occurred closing the configuration file (aaaaaaaa.cnf). SNWHT01 ::[50725, 000000d2]: Mon Nov 19 08:27:43 2001 E_CL061B_DI_ACCESS User does not have adequate file permissions to open a file. The file may be marked read-only while a read-write open request was made. open() failed with operating system error 13 (Permission denied) SNWHT01 ::[50725, 000000d2]: Mon Nov 19 08:27:43 2001 E_DM9004_BAD_FILE_OPEN Disk file open error on database:nethealth table:Not a table pathname:/appl/eh/db/ckpsave.tdb/ingres/dmp/default/nethealth filename:aaaaaaaa.cnf open() failed with operating system error 13 (Permission denied) SNWHT01 ::[50725, 000000d2]: Mon Nov 19 08:27:43 2001 E_DM923B_CONFIG_CLOSE_ERROR Error occurred closing the configuration file (aaaaaaaa.cnf). SNWHT01 ::[50725, 000000db]: Mon Nov 19 08:29:17 2001 E_CL061B_DI_ACCESS User does not have adequate file permissions to open a file. The file may be marked read-only while a read-write open request was made. open() failed with operating system error 13 (Permission denied) SNWHT01 ::[50725, 000000db]: Mon Nov 19 08:29:17 2001 E_DM9004_BAD_FILE_OPEN Disk file open error on database:nethealth table:Not a table pathname:/appl/eh/db/ckpsave.tdb/ingres/dmp/default/nethealth filename:aaaaaaaa.cnf Before the upgrade, the permissions on file aaaaaaaa.cnf were Root , Root r w _ _ _ _ _ _ _ and after the upgrade were Root , Root r w x _ _ _ _ _ _ The database backup on these systems is consistently without checkpoints, and has not been restored from a checkpoint save at any time. Despite the fact that checkpoint saves are not being performed, these machines do have checkpoint locations designated. The customer attepted to do an interactive save via the console and receives the same error as listed above. He was then able to workaround the situation by changing the permissions to the aaaaaaaa.cnf file to 777. 1/25/2002 3:05:19 PM yzhang send this: echo "select * from iifile_info where file_name = 'aaaaaaaa'\g" | sql $NH_RDBMS_NAME >table_name.out also can you do a convert test to see if you can reproduce the problem 1/30/2002 7:54:16 AM mmcnally -----Original Message----- From: McNally, Mike Sent: Wednesday, January 30, 2002 7:44 AM To: Zhang, Yulun Subject: Pt 20505 After upgrade, file aaaaaaaa.cnf has wrong Yulun, Attached is the output of the sql command. Thanks, Mike 2/26/2002 4:20:59 PM ebetsold problem does not occur again after upgrade to 5.02 or installation of P1 1/21/2002 11:01:55 AM beta program ALCATEL USA: Reinhard Pfaffinger Rein.Pfaffinger@alcatel.com (972) 519-4943 The nhiDbStatus processes consume at or close to 100% CPU. 1/22/2002 10:32:34 AM wzingher If this is really using up 100% cpu in any use case, lets make that a beta4 fix. Please investigate to determine the severity of the problem, Saeed. 1/23/2002 11:45:43 AM rhawkes The customer thinks he did a couple of things wrong and is retrying the process. 1/28/2002 10:41:40 AM Betaprogram Hi Reinhardt, Last week you said you thought you did a few things wrong and were going to re-try the process. How did you make out? Please let us know as soon as you can. thanks! Melissa 1/29/2002 8:30:27 AM Betaprogram Melissa, I messed up the beta 3 reinstall after the migration. My Oracle SID and ehealth DB names were mixed up. I am reloading Oracle and beta 3. Thanks. 1/29/2002 10:18:11 AM Betaprogram email from site: Melissa, yes, please close the ticket as user error. Thanks. +1/21/2002 2:21:49 PM rrick Problem: nhSaveDb fails woth DMT_SHOW error Begin processing (1/21/2002 08:59:13 AM). Copying relevant files (1/21/2002 08:59:14 AM). Unloading the data into the files, in directory: 'D:/db_save.tdb/'. . . Unloading table nh_active_alarm_history . . . Unloading table nh_active_exc_history . . . Unloading table nh_alarm_history . . . Unloading table nh_alarm_rule . . . Unloading table nh_alarm_threshold . . . Unloading table nh_alarm_subject_history . . . Unloading table nh_bsln_info . . . Unloading table nh_bsln . . . Unloading table nh_calendar . . . Unloading table nh_calendar_range . . . Unloading table nh_assoc_type . . . Unloading table nh_elem_latency . . . Unloading table nh_element_class . . . Unloading table nh_elem_outage . . . Unloading table nh_elem_alias . . . Unloading table nh_element_ext . . . Unloading table nh_elem_analyze . . . Unloading table nh_enumeration . . . Unloading table nh_exc_profile_assoc . . . Unloading table nh_exc_subject_history . . . Unloading table ex_tuning_info . . . Unloading table exception_element . . . Unloading table exception_text . . . Unloading table nh_exc_profile . . . Unloading table ex_thumbnail . . . Unloading table nh_exc_history . . . Unloading table nh_list . . . Unloading table nh_list_item . . . Unloading table hdl . . . Unloading table nh_elem_latency . . . Unloading table nh_col_expression . . . Unloading table nh_element_type . . . Unloading table nh_le_global_pref . . . Unloading table nh_elem_type_enum . . . Unloading table nh_elem_type_var . . . Unloading table nh_variable . . . Unloading table nh_mtf . . . Unloading table nh_address . . . Unloading table nh_node_addr_pair . . . Unloading table nh_nms_defn . . . Unloading table nh_elem_assoc . . . Unloading table nh_job_step . . . Fatal Internal Error: Unable to execute 'COPY TABLE nh_job_step () INTO 'D:/db_save.tdb/njs_b47'' (E_RD0060 Cannot access table information due to a non-recoverable DMT_SHOW error (Mon Jan 21 08:59:47 2002) ). (cdb/DuTable::saveTable) Things that have been done: 1. Have rebilt nh_job_step index, nh_job_schedule index, and nh_run_step index. 2. Took system down then up. 3. Recreated save directory to a fresh directory. Customer does not have a good save. 1/25/2002 3:19:17 PM yzhang They have one stats0 table, nh_run_step and nh_job_step tables, they all have the problem of DMT_SHOW, and most likely their db had been corruptted. do the following: stop nhServer stop ingres (use ingstop, if fails use ingstop -force) ingstart login as nhuser and do the source verifydb -mreport ...... send the iivdb.log sysmod $NH_RDBMS_NAME > sysmod.out infodb $NH_RDBMS_NAME > info.out 2/5/2002 10:13:06 AM cestep Since the logs were provided by Russ Rick on 1/29, changing this back to assigned. 2/26/2002 2:31:00 PM yzhang customer has DMT_SHOW errror on nh_run_step. We need the following information: 1) what version of nethealth are they running 2) find out if this table and corresponding physical file exist. let me know if you don't know how to do this Thanks Yulun 2/26/20< 02 2:54:51 PM cestep Requested information from customer. 2/26/2002 3:07:08 PM cestep From the customer: select file_name from iifile where table_name = 'nh_run_step';\g continue * Executing . . . +--------+ |file_nam| +--------+ |aaaaaeoj| +--------+ (1 row) ------------------------------------------------------------------------------------------------------------------------------- $II_SYSTEM/ingres/data/default/nethealth: -rwxrwxrwx 1 Administrators None 1024000 May 30 2001 aaaaaeoj.t00 -------------------------------------------------------------------------------------------------------------------------------- Network Health version: 4.8.0 D0 - Patch Level: 0 2/26/2002 3:07:20 PM cestep moving back to assigned. 2/28/2002 8:02:49 AM cestep Received the results of "select * from nh_run_step" from the customer. File is on BAFS, under 58030/2.26.02 2/28/2002 8:04:23 AM cestep Actually, it's the result of "help table nh_run_step;\g" 3/4/2002 2:01:38 PM yzhang the output of nh_run_step has syntax error, I guess you did not look at the output file from customer, now you need to check with customer to make sure if they have this table in the database, then you can follow a script located on ~yzhang/scripts/dmt_show.sh (on system sulfur), note that the script is for different table, first you need to understand what the script will do, then work with customer on their table that shows dmt show error. Let me know if you have question 4/1/2002 5:31:00 PM yzhang What is the current status on this one, are you in the process of getting the information? 4/2/2002 11:04:36 AM dsionne Yulan, I apologize for not notifying you. Customer ended up reformatting disk and re-installing eHealth. Customer asked to close ticket. 4/2/2002 5:17:16 PM yzhang closed 1/21/2002 2:39:32 PM beta program UNISYS: Dave started migrating 1.5 GB database containing 1236 elements about 48 hours ago. It is still running. Wants to speak with someone to determine if he should stop it or let it continue. Dave Lerchenfeld's phone number is (734) 737-7202. 1/22/2002 1:47:04 PM rhawkes Xia retested this database and had no problems -- the migration completed in 2 hours. Dave thinks he may have applied an Ingres patch after loading the DB rather than before. He's going to try the whole process again. 1/23/2002 11:45:17 AM rhawkes Waiting for Unisys info 1/25/2002 11:42:15 AM shonaryar The reason migration was running so slow was because of corrupt ingres database. After running sysmod on ingres database, migration process was finished in time. saeed 1/25/2002 1:19:00 PM shonaryar just marking closed saeed 2/26/2002 11:11:54 AM Betaprogram Customer Verifies this is now working {1/23/2002 10:24:02 AM Betaprogram PWC: PWC Global Don Mount donald.d.mount@us.pwcglobal.com 813-348-7252 - Seperated out from Ticket # 19971 (multiple bugs) - Conversation Poller failures creating gaps in reporting. Also recieved multiple -dlg emails during poll failures. Sample syslog entry: Friday, December 14, 2001 04:52:14 PM Error (nhiMsgServer) Pgm nhiMsgServer: `Traffic Accountant` poller is not running (last poller activity at `12/14/2001 15:52:15`). Friday, December 14, 2001 04:55:07 PM (nhiPoller[Dlg]) Pgm nhiPoller[Dlg]: A scheduled poll was missed, the next poll will occur now (Conversations Poller). Friday, December 14, 2001 05:51:30 PM (nhiPoller[Dlg]) Pgm nhiPoller[Dlg]: `More than one sample of data has been written! `. Friday, December 14, 2001 05:55:06 PM Error (nhiMsgServer) Pgm nhiMsgServer: `Traffic Accountant` poller is not running (last poller activity at `12/14/2001 16:55:07`). Friday, December 14, 2001 06:00:41 PM (nhiDbServer) Thursday, December 20, 2001 9:42:30 AM beta program Additional info can be found in Public Folders/All Publice Folders/Engineering/Beta Test/5.5/Beta Sites (Active)/PWC/Bugs (Embedded image moved to file: pic00335.pcx) (See attached file: syslog1217) Thursday, December 20, 2001 12:06:13 PM beta program Email to site: Hi Don, In order to find out why some polls are missed we'd like to know what the CPU usage is on the system. Can you set up a CPU monitor and send a report that corresponds timewise with the missed polls? Thanks! Lorene Thursday, December 20, 2001 12:27:36 PM beta program Log forwarded to Eric Karten. Thursday, December 20, 2001 1:13:46 PM ekarten Where's the log? Also, please ask the customer if the data gaps correspond to the missed polls. Reports may be helpful in determining this. Thursday, December 20, 2001 2:54:17 PM beta program Email sent to site per Eric Karten Hi Don, It appears that the program did not do anything and then it ran out of memory. Please shut down non-Traffic Accountant applications on the system and to try all the steps again. So far we really have nothing to work with here toward solving the problem. Thanks! Lorene Thursday, January 03, 2002 9:32:42 AM mfintonis -----Original Message----- From: Karten, Eric Sent: Thursday, January 03, 2002 9:18 AM To: 'donald.d.mount@us.pwcglobal.com' Cc: Venuto, Donna; Betaprogram Subject: Still Awaiting Information Donald, I am still awaiting the database on other information in order to work on beta bugs 19799 and 19971 for eHealth 5.5. Friday, January 04, 2002 9:43:39 AM beta program Email from Beta Site: It looks like I finally transferred the file successfully this morning. 150 Binary data connection for pwcta1221.tar.gz (155.201.35.52,32764). 226 Transfer complete. local: pwcta1221.tar.gz remote: pwcta1221.tar.gz 441630263 bytes sent in 2.3e+03 seconds (187.76 Kbytes/s) Friday, January 04, 2002 11:41:51 AM ekarten The dB was for the other bug. Sorry for the confusion. What I need here is information on probe timeouts and how long polls take to try and find out why the poller seems to be stopping. In order to gather this information here some instructions to follow. We call this running the poller in print perf mode. - Disable the poller in the startup.cfg file. - cd $NH_HOME/sys - vi startup.cfg - find the nhiPoller entry with the -dlg switch and change disable to yes - save changes and exit - Stop and restart the nethealth servers. nhServer stop nhServer start - Start the poller by manually, from $NH_HOME/bin/sys. - nhiPoller -dlg -printPerf 2 > $NH_HOME/tmp/printPerf.wri 2>&1 - Let it poll for three or four polls. - Interrupt it with Control C. - Reset the startup.cfg. - Stop and restart the nethealth servers. nhServer stop nhServer start - Send me the $NH_HOME/tmp/printPerf.wri Monday, January 14, 2002 4:00:31 PM beta program re-requesed more info from site Tuesday, January 15, 2002 9:32:59 AM beta program Melissa, I currently am running with no errors since I had to start with a fresh database. As the database grows I suspect that Rollups failures and poll misses will start happening. Thanks, Don Tuesday, January 15, 2002 9:33:51 AM beta program I understand that the problems occur after some time. However, if you wait until the problems begin to perform the test I wont get the results I need to address this issue. Please start this test before the problems begin to appear. If you need to delete the printPerf.wri file because of disk space, as long as the errors have not begun to occur, please do so. Thanks. Eric Tuesday, January 15, 2002 10:53:22 AM beta program Donald, Did you notice if a core file was created when the roll-ups failed? If you still have it would you please send it to the FTP site? Also, what is the value of 'ulimit' (just type ulimit)? If it is not unlimited, try setting it to unlimited. Let me know if the failure is delayed or no longer occurs. I'd still like you to do the print perf test, preferably at the same time in case the error still eventually occurs. Thanks for your help. Eric Wednesday, January 16, 2002 1:22:50 PM beta program Attached is the requested file output. (< See attached file: printPerf.wri) Thanks, Don (Attachment stored in original email from beta site in outlook public folder: Engineering>Beta Test>5.5>Beta Sites (active)>PWC Global>Issues/Bugs. Also emailed to Eric Karten) Wednesday, January 16, 2002 3:17:29 PM mfintonis I think we can tweak the performance of the probes here. It may help. You've got one probe that looks like it is being timed out for taking more than 20 minutes to respond (InetPrimaryIn-probe-3). The rest are taking no more than five minutes or so. I can't tell from the data if the probe has a configuration or hardware problem or if it is on a slow connection but you'll need to look into this. If you can't fix the probe, try disabling it in the poller configuration window. Also, several probes (including the slow one) are using the old SNMP v1 method of collecting data (Get Next) instead of the newer SNMP v2 method (Get Bulk). We support this newer method in eHealth 4.8 and later. You should use it for all probes if possible. You might even be able to get firmware upgrades for some of the probes that don't currently support Get Bulk. There is a new feature added since 5.0 which may also help a little. You can set the environment variable NH_DLG_TIME2KEEP to the number of hours that a node will remain in memory without seeing any more activity. The default is 24 hours. If you reduce this to 12 hours the rate of growth of the poller node cache will slow. You can adjust this value to your liking but there is a point of diminishing returns if you set this value too low, nodes will simply keep reappearing after being deleted. Be sure to cycle the servers (stop and start) after changing this value so it can take effect. There is also another variable NH_POLL_DLG_BPM which has a default value of 500 bytes per minute as the minimum data rate for a conversation to be recorded. If you increase this value to 750 or 1000 you will also see a slowing of the growth of the node cache. Again, cycle the pollers to take effect. I have not been able to load the dB because it is so large. We have equipment on order so that I may be able to do this. One last question? How many nodes are you seeing in the cache (database/status options)? Do you have some idea of how long it takes to get to that value? Eric Thursday, January 17, 2002 8:08:22 AM mfintonis Eric, Not sure why the probes were not set to bulk get. I was getting a timeout message on the InetPrimaryIn-probe-3 about the 20 minute timeout. This probe has lots of conversations since it is on the Internet. I will set the agent type to RMON2 (conversations) via SNMPv2c GetBulk. I guess my question would be why didn't these discover with the bulk get agent type. Your explanation of the environment variables is very helpful. Shouldn't the NH_DLG_TIME2KEEP variable be in the nethealthrc.sh file so it can be easily modified? How does the old variable NH_UNREF_NODE_LIMIT=3 fit in ? Currently my database status shows 417,288 nodes. On the old system I was given a command to see how many conversation pairs were in the database which sometimes grew to millions. How can I get the info from this system? I recently found an issue with Name Nodes in which it failed a few times. I found that Name Nodes used quite a bit of space on my system root tmp file system /tmp. Shouldn't it use $NH_HOME/tmp instead? Name Nodes required about 2GIG of disk space on a fairly new database (2days old). I'm afraid to run Name Nodes a second time. Thanks, don Thursday, January 17, 2002 12:26:00 PM mfintonis Meeting Scheduled to discuss: Subject: Updated: PWC 5.5 Bug 19971 Location: CR6-Cancun Start: Thu 1/17/02 12:00 PM End: Thu 1/17/02 12:30 PM Recurrence: (none) Meeting Status: Accepted Required Attendees: Karten, Eric; McAfee, Stephen; Zingher, Wendy; Venuto, Donna; Karten, Eric; Pratt, Gary; Kaufman, Joel; Amaral, Donna; Trei, Robin; D'silva, Anil; Pattabhi, Ravi; Wolf, Jay Optional Attendees: Tisdale, Lorene Rescheduling this for lunch time on the same day at the request of Steve. We are looking for a short term solution (such as tuning Oracle) to the problem reported by PWC in problem ticket 19971. Please note that PWC has another critical problem ticket 19307 against 4.7 that shows similar symptoms. Thursday, January 17, 2002 1:51:45 PM mfintonis Karten, Eric" on 01/17/2002 11:36:46 AM To: Donald D Mount/US/GTS/PwC cc: Subject: RE: 19971 - NEED MORE INFO Don, Turns out we only support SNMP v2 (get bulk) for Netscout probes. If yours are Netscout then they may not be configured correctly. Internally the discover module sends SNMP v1 requests for all probes. If a probe responds with a Netscout OID then discover will send out a second request using v2. If there is no response to the second request, eHealth records the probe as only capable of get next (v1). Otherwise, the probe element is recorded as supporting get bulk (v2). Simply changing the agent type in the poller configuration window won't have any effect. The probe timeout is a feature new for 5.0. The default timeout is 20 minutes. If a probe hits the timeout before completing its response to a poll, the poller will save the data it has so far received. Unsent data may be lost. You can modify this value with a new environment variable NH_POLL_PROBE_TIME_LIMIT. Please refer to the eHealth Administration Reference, pg. 53 for important usage notes. Traffic Accountant was not designed to handle probes monitoring the internet. This may be a big part of the performance degradation you are seeing. I suggest that you disable this probe. However, if this traffic is important to you then you might try decreasing the timeout value to 10 to 15 minutes. This would increase the amount of internet data you are losing but it would also reduce the growth rate of the node cache as well as increase the likelihood that the poller will complete before the poller itself reaches the normal end of a cycle. You can put any environment variable in the nethealthrc.csh.usr file. Most of the time when we introduce a new variable we don NOT put it in the nethealth.csh file. The new variable NH_DLG_TIME2KEEP replaces NH_UNREF_NODE_LIMIT. I don't know about the command for determining the number of node address pairs. You may be able to use it with 5.5 or it may need revision. Can you send me the command? Was it a script? There is one other small thing I thought of that you could do to improve performance. If you are not doing any stats, import or traps polling you can turn these pollers completely off (otherwise they remain idle but still consume memory). In $NH_HOME/sys there is a file called startup.cfg. This file lists four pollers under the section labeled "program nhiPoller". Do the following for only the three pollers that do not have the row "arguments -dlg" under the program section. In the row labeled "disable", in the next column on that row place the string "yes #" in front of the existing text. Save the file and cycle the servers. Eric Thursday, January 17, 2002 1:52:00 PM mfintonis -----Original Message----- From: donald.d.mount@us.pwcglobal.com [mailto:donald.d.mount@us.pwcglobal.com] Sent: Thursday, January 17, 2002 12:37 PM To: EKarten@concord.com Cc: jwitte@concord.com Subject: RE: 19971 - NEED MORE INFO Eric, All my probes are Netscout probes. It is my understanding that the goal at some point is for Traffic Accountant to support Internet probes. After setting the probe manually with the bulk get agent type this probe has stopped the timeouts. The conversation node pair information was acquired with sql queries given to me by Yulun Zhang or Walter Burke while troubleshooting rollup failures with 4.8 & Ingres. Thanks, Don Thursday, January 17, 2002 1:52:19 PM mfintonis Don, I'm glad you've got the probes tuned. I would like to see another print perf if it's not too much trouble. The one you sent was missing some summary data usually sent to stderr. The summary data will give me clu< es into the performance of the poller. The command has to be run from a CSH the way I described it. Also, the 2>&1 at the end is needed to send stderr to the file. nhiPoller -dlg -printPerf 2 > $NH_HOME/tmp/printPerf.wri 2>&1 I appreciate all the effort you have put into helping me to help you. Eric Friday, January 18, 2002 11:26:37 AM smcafee Don, First here are my notes on your current configuration that we discussed. - Beta 3 installed roughly Jan 4. Database has been recreated sinced this was installed. - SysEdge and system health installed - Roughly 8 netscout probes and 1000 statistic elements. - 1 Netscout probe is on the internet - All netscout probes are now using Get Bulk. Previously 4 to 5 of them were not set to use this automatically via discover. - Roughly 400,000 nodes currently - 3 OC-3 probes will come on line this week (node and pair counts expected to rise) Also, I noted that this configuration in 4.8 would only work for about a month before scale problems would cause the system to fail and you would restart from scratch. Also on 4.8 conversations rollups would interfere with polling and cause data to be lost. This doesn't seem to be happening in 5.5. Here are the set of things we'd like you to collect at your earliest convenience. It looks lengthy, but it's just a series of edits and commands interspersed with periods of waiting. 1. Copy your current $NH_HOME/sys/debugLog.cfg to debugLog.orig. Save the attached debugLog.cfg to overwrite your current file. This file controls the debug tracing settings that programs will use when advanced logging is enabled. There is an "arguments" line for each program. I've set up what we want to collect for this test. 2. Delete or move everything out of $NH_HOME/log/advanced. You may already be running advanced logging there for Eric or someone else. If not just delete the files in this directory. 3. CD to $ORACLE_HOME/admin/udump/$ORACLE_SID and delete all the *.trc files. You'll probably need to be logged into the oracle user account to delete these *.trc files. 4. Make note of the time! From the console use "Setup->Advanced Logging..." to enable tracing for "Conversations Poller", "Console" and "Database Rollup". 5. After a couple of conversations polls you can go back into "Setup->Advanced Logging..." and disable both Console tracing and Conversations poller. 6. Make note of the time! Use the Motif Console to try to run the protocol reports you are having trouble with so that we can capture tracing on this. 7. Make note of the time! From the command line run nhiNameNodes -Dall > $NH_HOME/log/advanced/nhiNameNodes_dbg.txt 8. Wait for Conversations Rollup to run if it hasn't since you enabled Advanced Logging. 9. Go back into "Setup->Advanced Logging..." and disable "Conversations Rollup" advanced logging. 10. Log in as oracle. Extract the attached script runtkall to a file in the $ORACLE_HOME/admin/udump/$ORACLE_SID directory. CD to this directory, use "chmod -x" to make it executable and run it. Make a zip file that contains everything now in this directory. 10. Get the current node count from the Database Status dialog and get the node pair count via the following commands: $ sql $NH_RDBMS_NAME select count(*) from nh_node_addr_pair;\g \q 11. Make a zip file of the $NH_HOME/log directory including subdirectories. 12. Run a System At-A-Glance report for eHealth system for the current day and the last week. 13. Send us: - the zip file from $ORACLE_HOME/admin/udump/$ORACLE_SID - the zip file from $NH_HOME/log - the PDF and ascii versions of the two system reports - the node count and node pair count - the times for the steps where it was mentioned above. - any core files find after the test in $NH_HOME/bin or $NH_HOME/bin/sys or $NH_HOME Again, thanks for your help in making this a quality product. Let me know if you have any questions. Steve McAfee Software Development Manager Concord Communications, Inc (508) 303-4234 Tuesday, January 22, 2002 8:13:26 AM mfintonis FYI- I will start on your list today. Thanks, Don -------- Tuesday, January 22, 2002 9:09:33 AM beta program -----Original Message----- From: donald.d.mount@us.pwcglobal.com [mailto:donald.d.mount@us.pwcglobal.com] Sent: Tuesday, January 22, 2002 8:54 AM To: MFintonis@concord.com Subject: RE: TA discovery of NetScout probes beta 3 Melissa, I'm working on the list of things you want done. It appears that number 7 nhiNameNodes failed to generate an output to the log file nhiNameNodes_dbg.txt. This is what appeared on the screeen where I ran the command you gave me: Z,du ] rows: 296636 [Z,du ] sqlca.sqlcode: 0 [Z,du ] rows: 296637 [Z,du ] sqlca.sqlcode: 0 [Z,du ] rows: 296638 [Z,du ] sqlca.sqlcode: 0 [Z,du ] rows: 296639 [Z,du ] sqlca.sqlcode: 0 [Z,du ] rows: 296640 [Z,du ] sqlca.sqlcode: 0 [Z,du ] rows: 296641 [z,cu ] returning env var = 'NH_MSG_TXT_FILE' for type = 51 [z,cu ] returning env var val = '' for type = 51 [d,cu ] returning dflt str val = 'messageText.sys' for type = 51 [z,cu ] returning env var = 'NH_SYS_DIR' for type = 3 [z,cu ] returning env var val = '' for type = 3 [z,cu ] returning env var = 'NH_HOME' for type = 1 [z,cu ] returning env var val = '/opt/concord/neth' for type = 1 [d,cu ] returning file = '/opt/concord/neth/sys/messageText.sys' for type = 51 [i,cu ] Opening file = '/opt/concord/neth/sys/messageText.sys', mode = 0x5, prot = 0774 [z,cu ] Returning protection: 0444 for file: '/opt/concord/neth/sys/messageText.sys' [i,cu ] Opened file: '/opt/concord/neth/sys/messageText.sys' [i,cu ] Closing file = '/opt/concord/neth/sys/messageText.sys' [i,cu ] Close complete, status = Yes Internal Error: Unexpected Null value for 'C++ constructor' (possibly out of memory). (sv/newFailed) [d,cba ] Exit requested with status = 1 [d,cba ] Exiting ... Internal Error: Expectation for '_txnLevel == 0' failed (~DuDatabase - Unmatched transaction level in file ../DuDatabase.C, line 130). (cu/cuAssert) [d,du ] Disconnecting from db: NHTD, user: neth, handle: [0xfda30] ... [d,du ] Disconnected. uxpwcapp4% pwd /opt/concord/neth/log/advanced uxpwcapp4% ls -al total 387744 drwxr-xr-x 2 neth dba 4096 Jan 22 08:05 . drwxr-xr-x 7 neth dba 8192 Jan 22 08:45 .. -rw-r--r-- 1 neth dba 128121 Jan 22 07:24 nhiConsole.txt -rw-r--r-- 1 neth dba 24214125 Jan 22 08:10 nhiDialogRollup_30009.txt -rw-r--r-- 1 neth dba 0 Jan 22 07:52 nhiNameNodes_dbg.txt -rw-r--r-- 1 root dba 174146618 Jan 22 07:23 nhiPoller_Dlg.txt Thanks, Don Tuesday, January 22, 2002 9:23:12 AM beta program Melissa, I failed to copy the debug.cfg over properly and got down to step 9. Is any of the debug data good with the old debug.cfg file?? If not I will start over tomorrow morning. I have several projects I'm working on that I need to work on today. I will be on vacation from Thursday to next Wednesday so I hope to have the data completed by tomorrow. Thanks, Don Attachments stored in original email in outlook public folder: Engineering> Beta Test> 5.5> Beta Sites (active) > PWC Global . Bugs/Issues Tuesday, January 22, 2002 12:20:03 PM beta program There are several issues here. Let me focus on the Name Nodes problem he ran into. It appears Name Nodes ran out of memory trying to load >500K nodes. This is a know issue that has been scheduled for 5.6. Eric ---------------------------------------------------------------------------------------------------------------------- There are three different issues being handled under this one bug. It really should be three bugs. 1) Conversations poller fails. 2) Conversations rollups fails. 3) Traffic Accountant Name Nodes utility fails. I have been looking into the first issue. The database group is dealing with the second issue. The third issue is a known problem targeted for 5.6 Since this bug i< s assigned to me I will keep it and continue working on the poller aspect. Please open two other bugs for items 2 and 3. Assign the second bug to the dB group. Assign the third bug to me. I will promptly close this for the reason stated. THIS IS THE SECOND ISSUE FOR THE DB ROLLUP BUG 1/23/2002 1:39:05 PM Betaprogram Since the bug was split this only pertains to that part (20576). Eric -----Original Message----- From: Fintonis, Melissa Sent: Wednesday, January 23, 2002 11:38 AM To: Karten, Eric Cc: McAfee, Stephen Subject: FW: BUG #(19971) PWC - MORE INFO -----Original Message----- From: donald.d.mount@us.pwcglobal.com [mailto:donald.d.mount@us.pwcglobal.com] Sent: Wednesday, January 23, 2002 11:16 AM To: MFintonis@concord.com Cc: Betaprogram Subject: Re: BUG #(19971) PWC - MORE INFO The files have been ftp'd to your incoming directory at ftp.concord.com: pwcdbg.tar.gz nhtd.tar.gz Here are the set of things we'd like you to collect at your earliest convenience. It looks lengthy, but it's just a series of edits and commands interspersed with periods of waiting. 1. Copy your current $NH_HOME/sys/debugLog.cfg to debugLog.orig. Save the attached debugLog.cfg to overwrite your current file. This file controls the debug tracing settings that programs will use when advanced logging is enabled. There is an "arguments" line for each program. I've set up what we want to collect for this test. 2. Delete or move everything out of $NH_HOME/log/advanced. You may already be running advanced logging there for Eric or someone else. If not just delete the files in this directory. 3. CD to $ORACLE_HOME/admin/udump/$ORACLE_SID and delete all the *.trc files. You'll probably need to be logged into the oracle user account to delete these *.trc files. 4. Make note of the time! From the console use "Setup->Advanced Logging..." to enable tracing for "Conversations Poller", "Console" and "Database Rollup". (5:55AM) 5. After a couple of conversations polls you can go back into "Setup->Advanced Logging..." and disable both Console tracing and Conversations poller. 6. Make note of the time! Use the Motif Console to try to run the protocol reports you are having trouble with so that we can capture tracing on this. (7:29AM) Report Failed 7. Make note of the time! From the command line run nhiNameNodes -Dall > $NH_HOME/log/advanced/nhiNameNodes_dbg.txt (7:52AM)Name Nodes exceeded /tmp disk space and failed 8. Wait for Conversations Rollup to run if it hasn't since you enabled Advanced Logging. 9. Go back into "Setup->Advanced Logging..." and disable "Conversations Rollup" advanced logging. 10. Log in as oracle. Extract the attached script runtkall to a file in the $ORACLE_HOME/admin/udump/$ORACLE_SID directory. CD to this directory, use "chmod -x" to make it executable and run it. Make a zip file that contains everything now in this directory. 10. Get the current node count from the Database Status dialog and get the node pair count via the following commands: $ sql $NH_RDBMS_NAME select count(*) from nh_node_addr_pair;\g \q This command provided no output Please provide new procedure sqlplus and login info???? 11. Make a zip file of the $NH_HOME/log directory including subdirectories. ( pwcdbg.tar.gz) 12. Run a System At-A-Glance report for eHealth system for the current day and the last week. 13. Send us: - the zip file from $ORACLE_HOME/admin/udump/$ORACLE_SID ( nhtd.tar.gz) - the zip file from $NH_HOME/log (pwcdbg.tar.gz) - the PDF and ascii versions of the two system reports - the node count and node pair count (539,252 nodes from DBStatus) - the times for the steps where it was mentioned above. - any core files find after the test in $NH_HOME/bin or (No core files) $NH_HOME/bin/sys or $NH_HOME (See attached file: uxpwcapp4sys.txt)(See attached file: uxpwcapp4sys.pdf) (See attached file: pwcweek.pdf)(See attached file: pwcweek.txt) ---------------------------------------------------------------- The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. 1/31/2002 3:52:05 PM rtrei I ran the new nhiDialogRollups against the PWC database and it went fine. I am not positive that this is the actual PWC database that was experienceing the problems, although Eric K thought it was. I am closing this ticket, with the assumption that the changes put into the B4 nhiDialogRollup will help here. If there are still problems, we will need to reopen, but there is nothing more that can be done until the customer starts running with the new nhiDialogRollup. 2/1/2002 1:13:28 PM Betaprogram I spoke with Donna Venuto about this one. She agreed the ticket status should be MoreInfo in order for the customer to confirm that nhiDialogRollup in Beta 4 fixes the problem. -Donna Amaral 2/4/2002 8:45:18 AM Betaprogram Hi Don, Please confirm that nhiDialogRollup in Beta 4 fixes this problem. thanks! Melissa 2/7/2002 1:29:31 PM mfintonis Melissa, I need a new sql procedure for the node pairs. Here is the data you requested. 4. 10:27AM Appeared that Conversation polling of probes failed to collect while advanced logging turned on. 5. 11:20AM 6. 11:22AM Reports appeared to run properly 7. 11:24AM Created output file but no data in file. Suspect ran out of space on root /tmp [d,cu ] returning file = '/opt/concord/neth/sys/messageText.sys' for type = 51 [i,cu ] Opening file = '/opt/concord/neth/sys/messageText.sys', mode = 0x5, prot = 0774 [z,cu ] Returning protection: 0444 for file: '/opt/concord/neth/sys/messageText.sys' [i,cu ] Opened file: '/opt/concord/neth/sys/messageText.sys' [i,cu ] Closing file = '/opt/concord/neth/sys/messageText.sys' [i,cu ] Close complete, status = Yes Internal Error: Unexpected Null value for 'C++ constructor' (possibly out of memory). (sv/newFailed) [d,cba ] Exit requested with status = 1 [d,cba ] Exiting ... Internal Error: Expectation for '_txnLevel == 0' failed (~DuDatabase - Unmatched transaction level in file ../DuDatabase.C, line 130). (cu/cuAssert) [d,du ] Disconnecting from db: NHTD, user: neth, handle: [0xf78c8] ... [d,du ] Disconnected. 9. Dialog Rollup finished 12:23PM. 10. File was ftp'd to your incoming pwcnhtd.tar.gz 11. Node Count was 783719 via nhDbStatus - The sql script failed to work with the oracle database for node pairs. 12. The advanced log files were ftp'd to your incoming as pwclog.tar.gz No core files generated (See attached file: cap4week.pdf)(See attached file: cap4week.txt)(See attached file: cap4today.txt)(See attached file: cap4today.pdf) "Fintonis, Melissa" on 02/04/2002 08:35:35 AM To: Donald D Mount/US/GTS/PwC cc: Betaprogram Subject: RE: 5.5 B2 DB Bug (20576)PWC Hi Don, Please confirm that nhiDialogRollup in Beta 4 fixes this problem. thanks! Melissa -----Original Message----- From: Fintonis, Melissa Sent: Wednesday, January 23, 2002 1:32 PM To: 'donald.d.mount@us.pwcglobal.com' Cc: Betaprogram Subject: 5.5 B2 DB Bug (20576)PWC Since the bug was split this only pertains to that part (20576). - just trying to keep the tickets separate - an FYI email noting the number with the feedback -----Original Message----- From: donald.d.mount@us.pwcglobal.com [mailto:donald.d.mount@us.pwcglobal.com] Sent: Wednesday, January 23, 2002 11:16 AM To: MFintonis@concord.com Cc: Betaprogram Subject: Re: BUG #(19971) PWC - MORE INFO The files have been ftp'd to your incoming directory at ftp.concord.com: pwcdbg.tar.gz nhtd.tar.gz Here are the set of things we'd like you to co< llect at your earliest convenience. It looks lengthy, but it's just a series of edits and commands interspersed with periods of waiting. 1. Copy your current $NH_HOME/sys/debugLog.cfg to debugLog.orig. Save the attached debugLog.cfg to overwrite your current file. This file controls the debug tracing settings that programs will use when advanced logging is enabled. There is an "arguments" line for each program. I've set up what we want to collect for this test. 2. Delete or move everything out of $NH_HOME/log/advanced. You may already be running advanced logging there for Eric or someone else. If not just delete the files in this directory. 3. CD to $ORACLE_HOME/admin/udump/$ORACLE_SID and delete all the *.trc files. You'll probably need to be logged into the oracle user account to delete these *.trc files. 4. Make note of the time! From the console use "Setup->Advanced Logging..." to enable tracing for "Conversations Poller", "Console" and "Database Rollup". (5:55AM) 5. After a couple of conversations polls you can go back into "Setup->Advanced Logging..." and disable both Console tracing and Conversations poller. 6. Make note of the time! Use the Motif Console to try to run the protocol reports you are having trouble with so that we can capture tracing on this. (7:29AM) Report Failed 7. Make note of the time! From the command line run nhiNameNodes -Dall > $NH_HOME/log/advanced/nhiNameNodes_dbg.txt (7:52AM)Name Nodes exceeded /tmp disk space and failed 8. Wait for Conversations Rollup to run if it hasn't since you enabled Advanced Logging. 9. Go back into "Setup->Advanced Logging..." and disable "Conversations 1/23/2002 10:53:00 AM Betaprogram CCRD SUPPORT: Farrell O'Connor foconnor@concord.com 508-303-4349 Installed Oracle successfully. Installed ehealth 5.5 beta3 successfully. typed nethealth. console came up fine. poller initiallization. pies went from orange to red. server stop unexpectedly. Wednesday, January 23, 2002 08:22:26 AM Pgm nhiPoller[Live]: Poller initialization complete (Fast Live Poller). Wednesday, January 23, 2002 08:27:27 AM System Event nhiServer The server has stopped. Wednesday, January 23, 2002 09:24:38 AM System Event nhiCfgServer Server started successfully. Wednesday, January 23, 2002 09:24:44 AM System Event nhiConsole Console initialization complete. Wednesday, January 23, 2002 09:25:15 AM Error nhiDbServer Pgm nhiDbServer: Database error: (ORA-01401: inserted value too large for co lumn ). Wednesday, January 23, 2002 09:24:48 AM Pgm nhiArControl: Controller has started. Product version is 5.5.0.0.1079.. Wednesday, January 23, 2002 09:25:19 AM Internal Error nhiCfgServer Pgm nhiCfgServer: Assertion for `elemPtr` failed, exiting (in file ./CfgServer.C, line 2793). Wednesday, January 23, 2002 09:25:57 AM System Event nhiCfgServer Server started successfully. Wednesday, January 23, 2002 09:26:03 AM System Event nhiConsole Initializing the console with the server on `kansas` . . . Wednesday, January 23, 2002 09:26:04 AM System Event nhiConsole Console initialization complete. Wednesday, January 23, 2002 09:26:06 AM Pgm nhiArControl: Controller has started. Product version is 5.5.0.0.1079.. Wednesday, January 23, 2002 09:26:38 AM Error nhiDbServer Pgm nhiDbServer: Database error: (ORA-01401: inserted value too large for co lumn ). Wednesday, January 23, 2002 09:26:39 AM Internal Error nhiCfgServer Pgm nhiCfgServer: Assertion for `elemPtr` failed, exiting (in file ./CfgServer.C, line 2793). 1/23/2002 11:24:59 AM Betaprogram Dear beta group, I overwrote my poller.cfg with poller.init and then rediscovered my elements using the old poller.cfg as a seed file. The server has now started. 1/23/2002 12:20:59 PM rpattabhi Problem resolved with newer poller config file. cloasing bug. -Ravi 1/28/2002 2:46:06 PM Betaprogram This is actually a Nobug due to user error >1/24/2002 8:29:13 PM rrick Problem: iimerge taking up 50% cpu when executing nhSavDb, nhCollectCustData, etc. Symtoms: Customer had a blackout last week. System came down and they brought it back up. They were able to poll and run some reports, but could not execute and db programs with them hanging. Additional info: They have a good save prior to blackout, but do not want to restore it because they will loose all their data from the last 6 days or so. Also noticed when we run verifydb it hangs for an hour. sysmod fails with duplicates in iiattribute table. All files on bafs/58000/58942 1/25/2002 8:01:35 AM mwickham -----Original Message----- From: Rick, Russell Sent: Thursday, January 24, 2002 08:25 PM To: Chapman, Sheldon; Wickham, Mark; Recchion, David Subject: 58942 - ESCALATE Customer is in New Zealand. ProServ has a very big deal on this customer in the pipeline. They are 18 hours ahead of our time. It will be Saturday and the customer is coming into work on this issue. I must call them at 3pm our time to talk with them. 1/25/2002 10:19:26 AM yzhang Russell: By looking at the information you posted on the call ticket. this customer's problem is that their database had been corrupted due to power outage on 1/18/02. the sysmod and verifydb output did indicated this is a problem. here is what we need to do: a) We will try to recover the database first, to do this I need the following: 1) infodb nethealth > infodb.out (this will tell me if they have checkpoint save) 2) errlog.log b) If we can not recover, we can dump all of the last six days data, but I need to know what data they expect to keep (stats0 data , element data?) c) very last thing we can do is to recycle the database The important thing is that they must keep the most recent saved database in the safe place. 1/25/2002 3:43:39 PM rrick Spoke with Chris: - Received info - System polling ok - iimerge at 50% CPU - He took system down and then brought it back up again.....and now iimerge at 1% CPU Sent info to Yulun. 1/25/2002 3:53:41 PM yzhang they have no checkpoint and no journalso there is no way for them to run forwarddb to recover database. what did you told the proServ this afternoon here is the what happen at the time of power outage U0023002::[32830 , 00000001]: Thu Jan 17 23:17:52 2002 E_SC0129_SERVER_UP Ingres Release II 2.0/9808 (su4.us5/00) Server -- Normal Startup. U0023002::[32830 , 00000017]: Thu Jan 17 23:20:05 2002 E_US0014 Database not available at this time. o The database may be marked inoperable. This can occur if CREATEDB failed. o An exclusive database lock may be held by another session. o The database may be open by an exclusive (/SOLE) DBMS server. U0023002::[32830 , 00000017]: Thu Jan 17 23:20:05 2002 E_SC0123_SESSION_INITIATE Error initiating 1/25/2002 4:36:31 PM rrick -----Original Message----- From: Zhang, Yulun Sent: Friday, January 25, 2002 3:44 PM To: Rick, Russell Subject: RE: Call Ticket #58942 & Problem Ticket #20616 attn Russ Rick they have no checkpoint and no journalso there is no way for them to run forwarddb to recover database. what did you told the proServ this afternoon here is the what happen at the time of power outage U0023002::[32830 , 00000001]: Thu Jan 17 23:17:52 2002 E_SC0129_SERVER_UP Ingres Release II 2.0/9808 (su4.us5/00) Server -- Normal Startup. U0023002::[32830 , 00000017]: Thu Jan 17 23:20:05 2002 E_US0014 Database not available at this time. o The database may be marked inoperable. This can occur if CREATEDB failed. o An exclusive database lock may be held by another session. o The database may be open by an exclusive (/SOLE) DBMS server. U0023002::[32830 , 00000017]: Thu Jan 17 23:20:05 2002 E_SC0123_SESSION_INITIATE Error initiating Spoke with Dave Romans: - Deal is with a report for this customer - See Bob Keville for instructions on split merge process to recover db. Spoke with Bob Keville: - Is this clien< t a good candiddate for the ProServ split merge process? - Need to recover db....at leasta partial save. -----Original Message----- From: Rick, Russell Sent: Friday, January 25, 2002 4:20 PM To: Zhang, Yulun Subject: RE: Call Ticket #58942 & Problem Ticket #20616 attn Russ Rick Is there anyway to pull out the data directly from the db? I talked with Proserv and Bob Keville and they said we must recover the db before we can do a split merge. Can we recover any data at all? I am going to see if the customer can sql into the db and see if the polling he is doing is actually getting inserted into the db. - Russ Spoke with Chris: - Can sql nethealth - Can sql iidbdbd - Trend report on polling data to see if the data is getting inserted into the db was successful. 1/28/2002 10:11:17 AM yzhang Have asked support work with customer on recoving db through asc unload and reload 1/28/2002 3:53:06 PM mwickham -----Original Message----- From: Wickham, Mark Sent: Monday, January 28, 2002 03:43 PM To: Zhang, Yulun Cc: Rick, Russell Subject: Problem Ticket 20616 Yulun, The customer attempted to run the nhSaveDb command, but killed it after 2-1/2 hours. He did send in some files which are located on BAFS in \escalated tickets\58000\58942\28Jan01. Can you explain in detail how we ask the customer to recover the db through ascii unload and reload? Thanks - Mark 1/29/2002 10:29:13 AM yzhang waiting customer on ascii db unloading 1/30/2002 1:20:01 PM yzhang How is the unloading and reloading going 1/30/2002 1:57:55 PM yzhang Russel and Bob OK, please find out what new data they want to load into the existing db save. check with customer on this have customer keep the most recent dbsave in a safe place. This ticket should be de-escalated and pass to proServ as Bob suggested before Yulun 2/1/2002 11:16:54 AM yzhang db loading is fine, ticket closed, refer to ProServ for any additional work :1/25/2002 2:23:03 PM Betaprogram SEAGATE: SEAGATE: nhiPollSave took 4 days to complete. Log attached in public folder. CONTACT INFO: Joe Madi: AE doing testing: cell (650) 224-7470 Customer contact: Kelvin Cheah (831) 4239-7661 1/28/2002 2:41:11 PM rhawkes This is very similar to a hang that was seen at CSC. There, the cause was corruption in the Ingres database, which running the SYSMOD procedure corrected. Seagate is currently trying this out. 1/30/2002 10:24:21 AM Betaprogram I just talk to Joe Madi at SEAGATE, apparently after he executed sysmod every thing started running in proper time. I think we probably should add this to migration procedure. I will check on that and let you know. saeed 1/30/2002 4:10:48 PM rhawkes Saeed I entered another ticket for Seagate's current problem. Since this one is resolved please close it. Thanks. 2/1/2002 8:55:13 AM shonaryar Patty added a section to DOC about running sysmod if this happens saeed 3/4/2002 12:10:27 PM Betaprogram Customer Verified this is FIXED IN BETA 4 1/28/2002 11:11:46 AM foconnor Ascii save fails on 5.0.2. Ran a nhSaveDb with the "-ascii" option and it has failed. I have a customer that ran this on 4.7.1 and in failed on reading sample data (they have stats1 and stats2 tables saved but not stats0 tables)( Scheduled normal save takes about 9 minutes. Command line ascii save went ~2 hours and 25 minutes and failed. F:\eHealth>nhSaveDb -p f:/ehealth/db/save/ascii.tdb -ascii ehealth See log file F:/eHealth/log/save.log for details... Begin processing 1/28/2002 07:58:47. Copying relevant files (1/28/2002 07:58:50). Fatal Internal Error: Sql Error occured during operation (E_QE009C Unexpected error received from another facility. Check the server error log. (Mon Jan 28 10:23:09 2002) ). (du/dbGetColumnString) ================================================================================ Save.log Unloading the data into the files, in directory: 'F:/ehealth/db/save/ascii.tdb/'. . . Unloading table nh_alarm_attribute . . . Unloading table nh_active_alarm_history . . . Unloading table nh_active_event_history . . . Unloading table nh_active_event_data . . . Unloading table nh_active_exc_history . . . ***cut out for brevity***** . . . Unloading table nh_node . . . Unloading table nh_notifier . . . Unloading table nh_elem_type_used . . . Unloading the sample data . . . Monday, 1/28/2002 10:33:42 Msg received from import module 'netflow : Error: Unable to connect to database 'ehealth' (E_LQ0001 Failed to connect to DBMS session. E_LC0001 GCA protocol service (GCA_REQUEST) failure. Internal service status E_GC0139 -- No DBMS servers (for the specified database) are running in the target installation.. ). '. ================================================================================= Excerpt from the system log: Monday, 1/28/2002 10:33:46 Internal Error (Database Server) Database error: . (cdb/cdbJobMoveToRun) Monday, 1/28/2002 10:33:46 Starting job 'Discover' . . . (Job id: 1000017, Process id: 2624). Monday, 1/28/2002 10:33:46 Internal Error (Database Server) Database error: . (cdb/cdbJobAddRunRow) Monday, 1/28/2002 10:35:08 Error (Statistics Poller) Invalid process set specified for 'PC-DRECCHION-SH-NetHealth-nhiServer'. Monday, 1/28/2002 10:35:08 Error (Statistics Poller) Invalid process set specified for 'PC-FOCONNOR2K-SH-NetHealth-nhiServer'. Monday, 1/28/2002 10:35:08 Warning (Statistics Poller) Sql Error occured during operation (E_LQ002D Association to the dbms has failed. This session should be disconnected. ). Monday, 1/28/2002 10:35:09 Error (LiveExceptions Server) Sql Error occured during operation. Monday, 1/28/2002 10:35:09 Error (LiveExceptions Server) Sql Error occured during operation. Monday, 1/28/2002 10:35:09 Error (LiveExceptions Server) Unable to execute 'set lockmode on nh_alarm_history where level=table'. Monday, 1/28/2002 10:35:09 Error (Statistics Poller) Unable to execute 'update nh_schema_version set minor_rev = minor_rev'. Monday, 1/28/2002 10:35:09 Error (Statistics Poller) Unable to add 'network element' data to the database, dropping this poll. Monday, 1/28/2002 10:36:44 Error (Conversations Poller) Failed to commit database transaction (E_LQ002D Association to the dbms has failed. This session should be disconnected. E_LC0031 Protocol write failure. Association with database partner failed (GCA_SEND) with status E_GC0001 -- Association failure: partner abruptly released association. ). ========================================================================================= 1/28/2002 11:13:37 AM foconnor IN HOUSE: Win 2000: F:\eHealth\db\save\ascii.tdb>nhShowRev eHealth version: 5.0.2 D01 - Patch Level: 1 1/29/2002 12:10:41 PM yzhang Can you send errlog.log and and full save.log Is this a problem from in house? 1/29/2002 1:00:42 PM foconnor Sent yulun save.log and errlog.log 1/30/2002 6:51:28 AM foconnor Similar problem occured at Reseller Edge On on two eHealth 5.0.2 servers. The ascii save failed at the point where the the stats1 and 2 tables get unloaded but the stats0 tables do not. 1/30/2002 9:38:48 AM tbailey Note: associated ticket 59210 for Fortis. Fortis is running 4.7.1, but the problem is the same. We can't get debug from Fortis because the asci save runs for 24 hours and hangs. We may be able to get an ftp or tape of their regular dbsave if you think it will help. 1/30/2002 3:16:47 PM yzhang Please de escalate 20678, most likely the ascii save will be ok, the other reason for the de escalation is that this is a in house problem. Thanks Yulun 1/31/2002 7:15:23 AM foconnor Ran ascii save again on Jan 30 and this is what is in the console (not written to the database) Wednesday, 1/30/2002 14:44:17 Error (Statistics Poller) Invalid process set specified for 'PC-FOCONNOR2K-SH-NetHealth-nhiServer'. Wednesday, 1/30/200< 2 15:20:25 Internal Error (Database Server) Database error: (E_LQ002D Association to the dbms has failed. This session should be disconnected. E_LC0031 Protocol write failure. Association with database partner failed (GCA_SEND) with status E_GC0001 -- Association failure: partner abruptly released association. ). (cdb/cdbJobMoveToRun) ...... Thursday, 1/31/2002 05:20:33 Internal Error (Database Server) Database error: . (cdb/cdbJobUpdateRunRow) Thursday, 1/31/2002 06:00:25 Internal Error (Database Server) Database error: . (cdb/cdbJobMoveToRun) Save log: Unloading table nh_notifier . . . Unloading table nh_elem_type_used . . . Unloading the sample data . . . -----> stopped here Win 2000 Event Viewer and message on the command line output: nhiSaveDb.EXE: Fatal Internal Error: Ok. (none/) 2/8/2002 10:01:30 AM ebetsold This problem has manifested at a customer site. They are trying to migrate a DB from a POC machine to Production. This ticket is in more info what information do you require? 2/14/2002 10:10:38 AM ebetsold This issue has been sitting idle for a few days now. I found this ticket in "more info" what information do you need. 2/19/2002 8:39:58 AM ebetsold This issue is marked as Critical. Please give me a status on this issue. 2/25/2002 11:02:12 AM rtrei will merge fix up from 4.7.1 after verified. 2/26/2002 11:31:19 AM dbrooks repeat of ticket number 20773 3/6/2002 10:04:50 AM ebetsold This is not a duplicate of 20773 20773 is a problem with a Top N report (symbol undefined for Stats (respPath) avgLineUtilization.) Robin you stated that you will merge the fix up from 4.7.1 when will this occur? 3/7/2002 11:29:21 AM rtrei Assigning to Yulun for merge up from 4.7.1 stream Change was only to /vobs/top/frameworks/cdf*/duLib/DuTable.C Just checked in the view to minnow_p08 stream. This must be merged up to the latest catfish branch and then up to piranha patch 3 branch. Because this is a change to the libraries, and this is a NT customer, you will need to work with CM about how to get a one-off of the CciFwDbIng library. (I don't htink you can build it in debug mode, so I suspect they will need to build this.) Tech Support would like this one-off by early afternoon. 3/8/2002 11:12:10 AM yzhang here is new build, please have customer replace CciFwDb.dll under $NH_HOME/lib, and replace nhiDbSave.exe under $NH_HOME/bin/sys. a backup should made prior to the replacement 3/11/2002 10:46:42 AM ebetsold db save appears to have completed sucessfully. I will update after dbrestore on Unix machine. 3/11/2002 3:46:40 PM ebetsold Per Elliot Slaon on site. Patch for 5.02 works like a charm. 3/15/2002 9:04:43 AM rsanginario Marking this as fixed. Passed Tribunal. 5/17/2002 4:05:03 PM rsanginario (per yulun) this is just a merge ticket. 1/28/2002 1:51:00 PM jpoblete Customer Facset Customer noticed the stats. poller was not writting data to the DB, also, he was getting the following errors while trying to run reports: E_DM004 Lock quota exceeded. Customer gets the following errors in the errlog.log E_CL1030_LK_EXPAND_LIST_FAILED LK failed while trying to allocate more locks lock lists. There are 0 locks lists, allocated at system startup time, of which 0 are used. E_DMA011_LK_NO_LLBS No more lock list blocks are available. It could be that locking system shared memory has been exhausted, or that the configured limit (ii.*.rcp.lock.list_limit) has been exceeded. E_CL1005_LK_NOLOCKS Out of lock resources Rebooting the server cleared the lock problem, but it re-occured again within one hour. Rebooted again today (01/28/2002), it seems to be polling fine up to now. Collected Ingres errlog.log and config.dat Spoke To Yulun Zhang, he agreed to open a problem ticket for this issue. 1/28/2002 4:59:42 PM jpoblete Customer also gets the following messages ... -----Original Message----- From: PPatel@factset.com [mailto:PPatel@factset.com] Sent: Monday, January 28, 2002 4:31 PM To: support@concord.com Cc: spochay@concord.com; plamachia@concord.com; jyoung@factset.com Subject: Console crashes when I try to discover elements When I launched a discovery, the console crashed. Server: win2K Sp2 NH: 5.0.2 Patch Level 1 Contract ID: 002085 Server Name: Concord Physical Address. . . . . . . . . : 00-02-A5-07-0A-29 Event Viewer Logs: Application popup: show.exe - Unable To Locate DLL : The dynamic link library NuTC.dll could not be found in the specified path E: \eHealth\bin\sys;.;C:\WINNT\System32;C:\WINNT\system;C:\WINNT;E: \eHealth\bin\sys;E:\eHealth\bin;E:\eHealth\oping\ingres\utility;E: \eHealth\oping\ingres\bin;E:\eHealth\nutcroot\bin;E:\eHealth\bin\mksnt;C: \WINNT\system32;C:\WINNT;C:\WINNT\System32\Wbem;C:\WINNT\PROGRA~1\VISION;C: \WINNT\PROGRA~1\VISION\System;C:\WINNT\PROGRA~1\COMMON~1\VISION. Application popup: eHealth : nhiIndexStats.exe: Internal Error: Unable to connect to database 'ehealth' (E_DM004B Lock quota exceeded. (Mon Jan 28 16:20:23 2002) ). (du/DuDatabase::dbConnect) System messages on Console: Application popup: eHealth : nhiIndexStats.exe: Internal Error: Unable to connect to database 'ehealth' (E_DM004B Lock quota exceeded. (Mon Jan 28 16:20:23 2002) ). (du/DuDatabase::dbConnect) See NT Event Log for more details Application popup: eHealth : nhiIndexStats.exe: Internal Error: Unable to connect to database 'ehealth' (E_DM004B Lock quota exceeded. (Mon Jan 28 16:20:23 2002) ). (du/DuDatabase::dbConnect) 1/28/2002 5:49:18 PM yzhang requested support ship 502P02 to customer, which includes a fix on logical lock error exceeded 1/29/2002 6:42:45 PM yzhang keep more close watch on any problem associated with nhiDbServer and libraries CciWscDb.dll. 1/30/2002 1:17:52 PM yzhang can you check with customer to see if they see the same error after upgrading to 502p02 1/31/2002 4:09:34 PM yzhang This is a same problem with 19342, the fix for 19342 work for this customer K1/29/2002 7:50:41 AM mmcnally nhFetch shows the following error message: Error: Unexpected database error nhSaveDb shows the following error message: Unload the dac tables. . . Unloading table nh_daily_exceptions_1000001 .... Fatal Internal Error: Unable to execute 'COPY TABLE nh_daily_exceptions_1000001 () INTO '/export/home/nethealth/db/save.tdb/nh_daily_exceptions_1000001'' (E_US0845 Table 'nh_daily_exceptions_1000001' does not exist or is not owned by you. (Wed Jan 2 03:08:32 2002) ). (cdb/DuTable::saveTable) Customer is running nethealth 4.8 P07 D06 on Solaris 2.8. all related files are on BAFS/58000/58487 2/1/2002 12:00:22 PM yzhang Colin,m there are two problem with this customer, dbsave fail and fetch fails for taking care dbsave fail please get the following: echo "select table_name,num_rows from iitables where table_name like '%daily%' and t able_name not like '%ix%' order by table_name\g" | sql $NH_RDBMS_NAME > daily_table.out echo "select table_name,num_rows from iitables where table_name like '%hourly%' and table_name not like '%ix%' order by table_name\g" | sql $NH_RDBMS_NAME > hourly_table.out echo "select * from nh_rpt_config\g" | sql $NH_RDBMS_NAME > rpt_config.out for taking care the fetch. please grab every thing on ~yzhang/remedy/48p9_distributed_poll_sol, and replace the same thing on each of the remote and central machine (be sure back up the original). do the replacement after remotesave and fetch finish, cc to me the email you will send to customer. Thanks Yulun 2/4/2002 11:23:39 AM mmcnally requested more info. 2/5/2002 7:57:55 AM mmcnally -----Original Message----- From: McNally, Mike Sent: Tuesday, February 05, 2002 7:47 AM To: Zhang, Yulun Subject: PT 20695 Database save fails with error. Hi Yulun, The requested files were collected for this for the customer on winnt. The reseller is on site today until 1:00 if we need anything else. The files are attac< hed and on BAFS/59000/59065/problemTicket Thanks, Mike 2/5/2002 3:14:34 PM yzhang for 59065: sql nethealth drop table nh_daily_exceptions_1000001 if this not work, use verifydb drop table (find the command from db_work_sheet if you don't know then run script from /export/sulfur3/nh48_s_m/prob_20695.sh, just typing the script name, get concord.out after running the script . concord.out is located under the directory where you run the script. note : script prob_20695.sh can only be run after the table nh_daily_exceptions_1000001 has been succeefully dropped. 2/6/2002 7:31:27 AM mmcnally -----Original Message----- From: McNally, Mike Sent: Wednesday, February 06, 2002 7:21 AM To: Zhang, Yulun Subject: PT 20695 Database save fails. Yulun, Attached is customer 59065 concord.out. It looks like the table still exists. I had them do the following before running the script: We need to drop the following table from the database: nh_daily_exceptions_1000001 1) Please type the following command from the $NH_HOME directory: sql nethealth 2) A * prompt will appear. Then type: drop table nh_daily_exceptions_1000001 If this returns you to the * prompt, it was successful. press ctrl c to exit the * prompt. and proceed to step 4. 3) If the sql command was not successful run the verifyDb command from the $NH_HOME directory with the following syntax: verifydb -mrun -sdbname -odrop_table nh_daily_exceptions_1000001 Substitue the for the actual name of the database. Default is nethealth. 4) Now place the attached script in $NH_HOME/temp and excecute by typing prob_20695.sh. This will create a .out file named concord.out in the same directory it is executed from. Please send this to support for review. 2/6/2002 9:36:29 AM yzhang can you do verifydb -mreport -sdbname $NH_RDBMS_NAME -otable nh_daily_exceptions_1000001, then send iivdb.log and errlog 2/6/2002 9:47:59 AM mmcnally Requested more info. 2/6/2002 10:48:00 AM mmcnally -----Original Message----- From: McNally, Mike Sent: Wednesday, February 06, 2002 10:38 AM To: Zhang, Yulun Subject: PT 20695 call ticket 59065 "database save is not working" Yulun, The customers response is below. He never sent in the files. He has successfully saved his database. Let me know if this is ok to close. We still are awaiting info from call ticket 58487 to see if the fetch worked. Thanks, Mike 2/7/2002 11:37:35 AM mmcnally -----Original Message----- From: NETHEALTH [mailto:NETHEALTH@fth2.siemens.de] Sent: Thursday, February 07, 2002 11:04 AM To: 'McNally, Mike' Subject: AW: 58487 "nhFetch and nhSaveDb fail with error" Hello Mike, we replaced the files as you adviced. After replacing the files the fetch had no error messages. We will have a look at the fetch the next days and give you more feedback. Attached you have the output of sql-statements. Regards Hermann Loscher Siemens Business Services GmbH & Co. OHG 2/14/2002 2:20:57 PM mmcnally all set. Fix worked. 2/14/2002 2:32:10 PM yzhang problem solved P1/30/2002 10:34:42 AM foconnor Customer has large database and they have attempted to perform an ascii save but the save failed after about 26 hours. Customer is going to load the database onto a Solaris server running 5.0.2 Network Health 4.7.1 Windows NT Database is 8GB Excerpt from the save.log: Unloading table nh_schedule_outage . . . Unloading table nh_element . . . Unloading table nh_deleted_element . . . Unloading table nh_var_units . . . Unloading the sample data . . . Fatal Internal Error: Ok. (none/) File listing of save in is: BAFS/escalated tickets/59000/59210 I have requested the last good binary save. Reseller will ask customer customer is Fortis Bank. See also Problem ticket 20678 1/30/2002 3:28:08 PM yzhang Farrel, As we talked, please do the following. 1) get the customer dbsave (from regular save), have them place it in a safe place 2) have them place the incomplete ascii save in the safe place too 3) then do the following: a) stop nhServer (through service) b) stop ingres (through service) c)start ingres (through service) d) check to make sure there are four ingres processes are running e) run ascii save f) send me errlog.log and save.log ( I need the complete errlog.log) you can send this email (cc to me) to customer today if you have time, so we can talk to them tommorrow morning. Thanks Yulun 1/31/2002 1:00:48 PM foconnor Customer is sending their database to our ftp site. 1/31/2002 2:58:31 PM yzhang Farrel is in the process of loading the customer db on a NT system, then try to do the ascii save. At same time, we send some commands to customer for doing ascii save. 2/4/2002 7:44:44 AM foconnor -----Original Message----- From: O'Connor, Farrell Sent: Monday, February 04, 2002 7:34 AM To: Zhang, Yulun Cc: O'Connor, Farrell Subject: Problem ticket 20733: Call ticket 59210 Yulun, I have received the database from the customer. //BAFS/escalated tickets/59000/59210/db/59210_db.tar Also on our ftp site: ftp.concord.com/incoming. 2/4/2002 9:36:02 AM yzhang please load the customer database on the same platform and ehealth version as customer. let me know after you done the load. please load it as soon as possible. Thanks Yulun 2/5/2002 10:06:26 AM yzhang I think I forgot -ascii option, the correct one should be: D:\nh5.0\db\save>nhiSaveDb -Dall -p my_test_asc.tdb -d nethealth50 -ascii How is this command work? did you check with customer regarding the database problem we talked in you office yesterday. the db save they sent is out of date, is that the db they want to convert from? 2/6/2002 10:20:29 AM foconnor The database save failed on my machine. There is no save.log for this save (this is the only time I have not seen the save.log) It looks like the ascii save fails on the first nh_stats0 table. Maybe nhiNtUtil is the culprit it exits with a "1". From the screen: Unloading table nh_deleted_elemen Unloading table nh_var_units . . Unloading the sample data . . . Fatal Internal Error: Ok. (none/) Debug output: [z,cu ] returning env var = 'NH_DB_ASCII_DELIM' for type = 152 [z,cu ] returning env var val = '' for type = 152 [d,cu ] returning dflt str val = '|' for type = 152 [z,cu ] returning env var = 'NH_DB_ASCII_EOL' for type = 153 [z,cu ] returning env var val = '' for type = 153 [d,cu ] returning dflt str val = 'null' for type = 153 [d,du ] End transaction level 2 [d,du ] Executing SQL cmd 'COPY TABLE nh_dlg0_994139999 (sample_time=CHAR(0)'|',nap_id=CHAR(0)'|',proto_id=CHAR(0)'|',dlg_src_id=CHAR(0)' null) INTO 'my_test_asc1.tdb/nh_dlg0_994139999.ascii'' ... [d,du ] DuDatabase (execSql): errorOnNoRows: No [Z,du ] (dbExecSql): errorOnNoRows: No [Z,du ] (dbExecSql): sqlCmd: COPY TABLE nh_dlg0_994139999 (sample_time=CHAR(0)'|',nap_id=CHAR(0)'|',proto_id=CHAR(0)'|',dlg_src_id=CHAR(0 0)null) INTO 'my_test_asc1.tdb/nh_dlg0_994139999.ascii' [Z,du ] (dbExecSql): sqlca.sqlcode: 0 [Z,du ] (dbExecSql): rows: 99817 [Z,du ] returning DuScNormal [d,du ] Cmd complete, SQL code = 0 [d,du ] Saved table successfully. [d,du ] Committing database transaction ... [d,du ] Committed. [d,du ] End transaction level 1 [z,cu ] returning env var = 'NH_BIN_SYS_DIR' for type = 4 [z,cu ] returning env var val = '' for type = 4 [z,cu ] returning env var = 'NH_BIN_DIR' for type = 2 [z,cu ] returning env var val = '' for type = 2 [z,cu ] returning env var = 'NH_HOME' for type = 1 [z,cu ] returning env var val = 'D:/nethealth' for type = 1 [z,cu ] returning env var = 'NH_DBLOC_STS_RAW' for type = 1135 [z,cu ] returning env var val = '' for type = 1135 [z,du ] Saving table nh_stats0_994136399 to file my_test_asc1.tdb/nh_stats0_994136399.ascii ... [d,du ] Begin transacti< on level 1 [d,du ] Begin transaction level 2 [Z,du ] sqlca.sqlcode: 100 [Z,du ] rows: 0 [d,du ] End transaction level 2 [d,du ] Rolling back database transaction. [d,du ] End transaction level 1 [z,cu ] returning env var = 'NH_BIN_SYS_DIR' for type = 4 [z,cu ] returning env var val = '' for type = 4 [z,cu ] returning env var = 'NH_BIN_DIR' for type = 2 [z,cu ] returning env var val = '' for type = 2 [z,cu ] returning env var = 'NH_HOME' for type = 1 [z,cu ] returning env var val = 'D:/nethealth' for type = 1 [d,cu ] returning programName = 'D:/nethealth/bin/sys/nhiNtUtil' for pgmId = 20 [d,cba ] Exit requested with status = 1 [d,cba ] Exiting ... 2/8/2002 10:13:50 AM foconnor -----Original Message----- From: O'Connor, Farrell Sent: Friday, February 08, 2002 10:04 AM To: Zhang, Yulun Cc: O'Connor, Farrell Subject: Problem ticket 20733: Call ticket 59210 Yulun, Can I get an update on the status of problem ticket 20733? 2/11/2002 4:49:31 PM foconnor -----Original Message----- From: O'Connor, Farrell Sent: Monday, February 11, 2002 4:40 PM To: Zhang, Yulun Cc: Trei, Robin; O'Connor, Farrell Subject: Problem ticket 20733 Importance: High Yulun, Can I get an update on the status of Problem ticket 20733? Regards, Farrell 2/19/2002 5:02:11 PM rtrei Have gotten minnnow installed. Loading database. Have questions about files in the escalated tickets directory. Need to talk with someone in Tech Support regarding such. 2/22/2002 10:42:05 AM mwickham Spoke with Robin on this issue. We have reproduced the issue here on an eHealth 4.7.1 installation using the customer's database. We've found where in the code we are failing. If a viable workaround is not found (dropping a specific table, for instance), a code change will need to be implemented and a one-off written. We are estimating a fix by the end of next week. 2/22/2002 3:59:24 PM rtrei Ihad thought this was failing because of bad data in the tables. It is actually failing when it queries to get the column names to return. Am in the process of debugging. Could not recreate on my system. Only thing I can think of, is that the table might have been rolled out in the interim from the start of the save (27 hours). Am changing the code to just continue of a table could not be found. exec should be available on Monday. 2/25/2002 11:06:37 AM rtrei put exe in esc tickets dir for Farrrel tl test 2/25/2002 1:42:18 PM mwickham Performing an ascii save right now on Farrell's machine (4.7.1 on NT 4.0 using the customer's db) and using Robin's new nhiSaveDb.exe. 2/26/2002 4:47:50 PM mwickham Testing new nhiSaveDb.exe on NT 4.0 eHealth 4.7.1 installation in lab where we loaded a fresh copy of the customer's database (Stats0 tables present). 2/28/2002 8:28:29 AM mwickham ASCII save is still running, however it has saved several stats0 tables...the fix appears to be working. We're sending it to the customer to run over the weekend. 3/4/2002 10:35:02 AM foconnor Ascii save on inhouse test machine was successful on March 01 (Sent Robin email) -----Original Message----- From: O'Connor, Farrell Sent: Friday, March 01, 2002 11:44 AM To: Trei, Robin Cc: O'Connor, Farrell; Wickham, Mark Subject: Problem ticket 20733 Robin, Our test ascii save (in house) was successful (it took 3 days, 2/26/2002 - 3/1/2002) Waiting to hear from customer. Begin processing (2/26/2002 02:29:54 PM). Copying relevant files (2/26/2002 02:29:55 PM). Unloading the data into the files, in directory: 'E:/59210/test3.tdb/'. . . . Unload of database 'nethealth' for user 'nhuser' completed successfully. End processing (3/1/2002 11:36:31 AM). 3/4/2002 10:35:52 AM foconnor Customer will test new nhSave executable on March 6 on their test machine. 3/15/2002 11:31:11 AM foconnor Customer is still testing the fix, they have ran into several non eHealth issues on the server which is delaying the save. 3/19/2002 7:11:56 AM foconnor Ascii save was successful however the load failed on 5.0.2. Trouble continues: the ascii-load fails on the production-server. Please advice asap! console output: ------------------ usucnh1p% nhLoadDb -p /export/home/concord/eHealth/nhDBsave -u concord ehealth See log file /opt/ehealth/log/load.log for details... Begin processing 19/03/2002 10:52:14. Cleaning out old files (19/03/2002 10:52:14). Copying relevant files (19/03/2002 10:52:14). Fatal Internal Error: Append to table nh_element failed, see the Ingres error log file for more information (E_CO0039 COPY: Error processing row 3323. Cannot convert column 'device_speed2' to tuple format. ). (cdb/DuTable::appendTable) system.log: -------------- Tuesday, March 19, 2002 10:51:27 AM System Event nhiConsole Console initialization complete. Tuesday, March 19, 2002 10:54:40 AM Fatal Internal Error nhiCfgServer Pgm nhiCfgServer: Call 'cdbFillElements' to database API failed. (dbs/DbsMsgHandler::getElementsCCb) Tuesday, March 19, 2002 10:55:28 AM Pgm nhiMsgServer: Scheduled job 'Data Analysis' was missed (Job id: 100002). Tuesday, March 19, 2002 10:55:28 AM Pgm nhiMsgServer: Scheduled job 'Health' was missed (Job id: 1000010). Tuesday, March 19, 2002 10:55:28 AM Pgm nhiMsgServer: Scheduled job 'Health' was missed (Job id: 1000026). Tuesday, March 19, 2002 10:55:28 AM Pgm nhiMsgServer: Scheduled job 'Health' was missed (Job id: 1000045). Tuesday, March 19, 2002 10:55:28 AM Pgm nhiMsgServer: Scheduled job 'Health' was missed (Job id: 1000047). Tuesday, March 19, 2002 10:55:28 AM Pgm nhiMsgServer: Scheduled job 'Health' was missed (Job id: 1000048). Tuesday, March 19, 2002 10:55:28 AM Error nhiMsgServer Pgm nhiMsgServer: Unable to obtain a step definition for step type '119' (job 'Import Elements' was disabled). Tuesday, March 19, 2002 10:55:28 AM Pgm nhiMsgServer: Scheduled job 'Health' was missed (Job id: 1000183). Tuesday, March 19, 2002 10:55:28 AM Pgm nhiMsgServer: Scheduled job 'Health' was missed (Job id: 1000185). Tuesday, March 19, 2002 10:55:28 AM Pgm nhiMsgServer: Scheduled job 'Health' was missed (Job id: 1000186). Tuesday, March 19, 2002 10:55:29 AM System Event nhiCfgServer Server started successfully. Tuesday, March 19, 2002 10:55:32 AM System Event nhiConsole Console initialization complete. Tuesday, March 19, 2002 10:55:36 AM Host usucnh1p: Pgm nhiArControl: Controller has started. Product version is 5.0.2.0.1305.. Tuesday, March 19, 2002 10:55:36 AM Error nhiNotifierSvr Pgm nhiNotifierSvr: Sql Error occured during operation (E_US0845 Table 'nh_notifier' does not exist or is not owned by you. (Tue Mar 19 04:55:36 2002) ). Tuesday, March 19, 2002 12:04:57 PM Error nhiMsgServer Pgm nhiMsgServer: Unable to obtain a step definition for step type '119' (job 'Import Elements' was disabled). Tuesday, March 19, 2002 12:04:59 PM System Event nhiCfgServer Server started successfully. Tuesday, March 19, 2002 12:05:01 PM System Event nhiConsole Console initialization complete. Tuesday, March 19, 2002 12:05:05 PM Host usucnh1p: Pgm nhiArControl: Controller has started. Product version is 5.0.2.0.1305.. Tuesday, March 19, 2002 12:05:04 PM Error nhiNotifierSvr Pgm nhiNotifierSvr: Sql Error occured during operation (E_US0845 Table 'nh_notifier' does not exist or is not owned by you. (Tue Mar 19 06:05:04 2002) ). Tuesday, March 19, 2002 12:05:29 PM Error nhiMsgServer Pgm nhiMsgServer: Unable to obtain a step definition for step type '119' (job 'Import Elements' was disabled). Tuesday, March 19, 2002 12:05:30 PM System Event nhiCfgServer Server started successfully. Tuesday, March 19, 2002 12:05:33 PM System Event nhiConsole Console initialization complete. Tuesday, March 19, 2002 12:05:37 PM Host usucnh1p: Pgm nhiArControl: Controller has started. Product version is 5.0.2.0.1305.. Tuesday, March 19, 2002 12:05:37 PM Error nhiNotifierSvr Pgm nhiNotifierSvr: Sql Error occured during operation (E_US0845 Table 'nh_notifier' does not< exist or is not owned by you. (Tue Mar 19 06:05:37 2002) ). Tuesday, 19/03/2002 12:05:40 The server stopped unexpectedly or database load completed, restarting . . . Tuesday, 19/03/2002 12:05:40 System Event Initializing the console with the server on 'usucnh1p' . . . Tuesday, 19/03/2002 12:06:45 Error (Console) Unable to connect to the server (the server is not running). Tuesday, 19/03/2002 12:06:45 System Event Console initialization failed. 3/19/2002 9:40:32 AM foconnor //BAFS/escalated tickets/59000/59210/March19 load.log 5.0.2 Recovering archive database . . . Initializing the Database . . . Creating the Tables . . . Loading the data from the Binary Files, in directory: '/export/home/concord/eHealth/nhDBsave/'. . . Loading table nh_alarm_rule . . . Loading table nh_alarm_threshold . . . Loading table nh_bsln_info . . . Loading table nh_bsln . . . Loading table nh_assoc_type . . . Loading table nh_elem_latency . . . Loading table nh_element_class . . . Loading table nh_elem_outage . . . Loading table nh_elem_alias . . . Loading table nh_element_ext . . . Loading table nh_elem_analyze . . . Loading table nh_enumeration . . . Loading table nh_exc_profile_assoc . . . Loading table ex_tuning_info . . . Loading table exception_element . . . Loading table exception_text . . . Loading table nh_exc_profile . . . Loading table ex_thumbnail . . . Loading table nh_list . . . Loading table nh_list_item . . . Loading table hdl . . . Loading table nh_elem_latency . . . Loading table nh_col_expression . . . Loading table nh_element_type . . . Loading table nh_elem_type_enum . . . Loading table nh_elem_type_var . . . Loading table nh_variable . . . Loading table nh_mtf . . . Loading table nh_address . . . Loading table nh_node_addr_pair . . . Loading table nh_nms_defn . . . Loading table nh_elem_assoc . . . Loading table nh_job_step . . . Loading table nh_list_group . . . Loading table nh_run_schedule . . . Loading table nh_run_step . . . Loading table nh_job_schedule . . . Loading table nh_system_log . . . Loading table nh_step . . . Loading table nh_schema_version . . . Loading table nh_stats_poll_info . . . Loading table nh_import_poll_info . . . Loading table nh_protocol . . . Loading table nh_protocol_type . . . Loading table nh_rpt_config . . . Loading table nh_rlp_plan . . . Loading table nh_rlp_boundary . . . Loading table nh_stats_analysis . . . Loading table nh_schedule_outage . . . Loading table nh_element . . . 3/19/2002 9:42:31 AM foconnor System log (5.0.2 datbase from 4.7.1 ascii save) Tuesday, March 19, 2002 12:20:33 PM Error nhiMsgServer Pgm nhiMsgServer: Unable to obtain a step definition for step type '119' (job 'Import Elements' was disabled). Tuesday, March 19, 2002 12:20:34 PM System Event nhiCfgServer Server started successfully. Tuesday, March 19, 2002 12:20:36 PM System Event nhiConsole Console initialization complete. Tuesday, March 19, 2002 12:20:41 PM Host usucnh1p: Pgm nhiArControl: Controller has started. Product version is 5.0.2.0.1305.. Tuesday, March 19, 2002 12:20:41 PM Error nhiNotifierSvr Pgm nhiNotifierSvr: Sql Error occured during operation (E_US0845 Table 'nh_notifier' does not exist or is not owned by you. (Tue Mar 19 06:20:41 2002) 3/21/2002 5:04:03 PM tbailey the ASCI load is now failing 3/21/2002 5:05:23 PM tbailey See log files in BAFS/59210/Mar 19 3/25/2002 6:39:18 AM foconnor usucnh1p% nhConvertDb ehealth Loading the Dac tables . . . usucnh1p% nhLoadDb -p /export/home/concord/eHealth/nhDBsave -u concord ehealth See log file /opt/ehealth/log/load.log for details... Begin processing 25/03/2002 10:07:39. Cleaning out old files (25/03/2002 10:07:39). Copying relevant files (25/03/2002 10:07:39). Fatal Internal Error: Append to table nh_element failed, see the Ingres error log file for more information (E_CO0039 COPY: Error processing row 3323. Cannot convert column 'device_speed2' to tuple format. ). (cdb/DuTable::appendTable) 3/27/2002 3:35:24 PM foconnor -----Original Message----- From: O'Connor, Farrell Sent: Wednesday, March 27, 2002 3:24 PM To: Trei, Robin Cc: O'Connor, Farrell Subject: Problem ticket 20733 Robin, Any thing new on this one? Anything you need? 3/29/2002 10:10:51 AM foconnor -----Original Message----- From: Trei, Robin Sent: Wednesday, March 27, 2002 3:26 PM To: O'Connor, Farrell Subject: RE: Problem ticket 20733 time would be nice :> This is in the queue. Until someone issues me more hours in the day to go with the increased work load, I can only do so much. 4/2/2002 11:09:03 AM hbui Since the loading process failed in between when it loaded nh_element_core, there might be some bogus data in the nh_element_core that screwed up the process. Asked Farrell to send me the ascii saved database that they (customer) used for loading. Meanwhile, I will find some 4.7.1 ascii database to load into 5.0.2 to verify loading process. 4/4/2002 10:46:21 AM dbrooks changed to more info per escalated ticket meeting 4/4. 4/4/2002 1:24:03 PM foconnor Received database on our ftp site: ftp.concord.com/incoming/nhDBsaveCC59210.rar Aslo: //BAF/escalated tickets/59000/59210/apr04_db 4/5/2002 7:39:48 AM hbui The unrared database miss several files ( such as file for table nh_schema_version, nh_element...) We may need the customer to re-send the database with the tar format instead. I did try to load an in-house ascii database and it went through. 4/5/2002 10:54:09 AM dbrooks changed to more info per escalation meeting 4/5. 4/5/2002 3:26:16 PM foconnor Requested database again and query for device_speed2 entries. 4/10/2002 9:22:41 AM foconnor -----Original Message----- From: O'Connor, Farrell Sent: Wednesday, April 10, 2002 9:11 AM To: Bui, Ha Cc: O'Connor, Farrell; Trei, Robin Subject: Problem ticket 20733 Importance: High Ha, The database save is on both the ftp site: ftp.concordc.com/incoming/CC59210-2002-04-08-AutoBackUp.tdb.tar and it was also downloaded into //BAFS/59000/59210/Apr10_db/CC59210-2002-04-08-AutoBackUp.tdb.tar. 4/11/2002 11:08:13 AM hbui The nh_element table has invalid or undefined device_speed2 which caused the failure. I changed that value to -1. Farrell shall send this file to customer to trying again. Meanwhile, engineers shall figure out why this bogus data get into the database. ( In the poller config file, there's no device speed2 for the element ) 4/12/2002 8:22:17 AM foconnor Ha, Is it possible to get a script or a procedure the customer can run during the save in ascii and then load process. That way the information will be current. 4/18/2002 10:48:46 AM dbrooks change to more info per escalated ticket meeting. 4/18/2002 11:16:10 AM foconnor We are waiting on the customer/reseller for verification. Called Reseller today and he needs to check with customer Fortis Bank. 4/29/2002 1:43:09 PM mmcnally The ascii save finished successfully. J1/30/2002 3:35:29 PM mmcnally Conversation rollups are kicked off by the scheduler. It seems to be hanging on the iimerge process. It seems the iimerge process hangs after it has rolled up the data. The database has remained static. When they run the nhiDiallogRollup -Dall command. they receive the following error message: INTERNAL: Couldn't open message file '' INTERNAL: Couldn't open message file '/usr/netHealth/sys/messageText.sys' (). ERlookup: Error accessing message text: For unknown reasons. Check messages files in the installation. Internal error. Report this problem to your technical representative. Files were placed on BAFS/58000/58723. 2/1/2002 1:46:08 PM yzhang Mike, Looks this customer need to do the following: 1) Resize ingresTransactionLog to 2G if the logsize is not 2G 2) unlimit the stack size, here is how If the cusotmer is using csh the command is: unlimit stacksize To < check what the limits are: limit If the customer is using ksh or sh , the command is ulimit -s 2097148 ulimit -a 2/1/2002 2:03:07 PM mmcnally Requested more info. 2/4/2002 11:21:35 AM mmcnally -----Original Message----- From: David J Morreale [mailto:David.J.Morreale@jpl.nasa.gov] Sent: Monday, February 04, 2002 11:08 AM To: Mike McNally Cc: Kelly Feagans; Donald Gallop Subject: Ticket 58723 - ingres_log Hi Mike, Please close this ticket. We will open another ticket if this occurs on the new machine. Thanks Dave M Call ticket was closed per customer email above.  2/4/2002 1:58:29 PM mmcnally Customer receives the following error in the ingres error.log when starting the database. Thu Jan 3 20:18:24 2002 E_CL25FF_CS_FATAL_ERROR The server has encountered a FATAL e$ ::[51974, ]: Thu Jan 3 20:18:24 2002 Bus Error (SIGBUS) They claim to be only monitoring 300 elements. 2/14/2002 4:01:20 PM yzhang Here is the detail description. Thu Jan 3 20:18:24 2002 E_CL25FF_CS_FATAL_ERROR The server has encountered a FATAL e$ ::[51974, ]: Thu Jan 3 20:18:24 2002 Bus Error (SIGBUS) Yes, this is an ingres error, but I want to know if the starting database is running ok?, is there core file produced. and what actual database problem they are experincing 2/19/2002 10:42:29 AM mmcnally -----Original Message----- From: McNally, Mike Sent: Tuesday, February 19, 2002 10:32 AM To: Zhang, Yulun Subject: RE: 20829/58455 Yulun, The database is running, no core files. This happens when they run nhStartDb and nhStopDb. The problem is the error message. They want to know what is causing it. Thanks, Mike 2/25/2002 5:43:27 PM yzhang Tell customer that the Bus error is originated from ingres memory allocation function call. I don't know what call it is. If they can send us the debug output for nhStartDb and nhStopDb, also send the output of grepping ingres progess after each nhStartDb and nhStopDb, we might be able to find more detail Yulun 2/25/2002 7:02:43 PM rrick Yulun, - This can happen on HP, Solaris, and WinNt. - It seems to happen mostly on HP, though. - Seems to me that this could be either a memory allocation issue or a stack issue. - It has shown up with other db issues. - It has also shown up when data analysis & data maintenance reports. - It also has shown up when Conversations Rollups have hung..........may not be an associate problem. 2/26/2002 12:13:42 PM yzhang Let's work on this issue just for this cutomer, I still need the following: debug output for nhStartDb and nhStopDb also send the output of grepping ingres progess after each nhStartDb and nhStopDb 2/26/2002 2:18:38 PM mmcnally requested debug from customer. 3/1/2002 7:40:42 AM mmcnally -----Original Message----- From: djgore@mmm.com [mailto:djgore@mmm.com] Sent: Thursday, February 28, 2002 4:40 PM To: McNally, Mike Cc: pratajczyk@concord.com Subject: Re: 58455 "Receiving Fatal BUS error in the ingres error log." Hi I don't have much time to fiddle with this now. Please remember this shows up on a clean install of ingres before loading the nethealth database back in. It doesn't show in the console but in the ingres/files/errlog.log file. Don Gore 3M IT Global Network Services 224-4N-27 Customer has not time to work this setting to low priority. 4/2/2002 5:20:20 PM yzhang closed, will reopen when customer want to work on this 4/3/2002 2:07:28 PM wburke -----Original Message----- From: Burke, Walter Sent: Wednesday, April 03, 2002 1:56 PM To: 'djgore@mmm.com' Subject: Ticket # 58455 - Ingres Errors Don, I have been re-assigned this ticket as Mike Mcnally is on vacation. I understand you have little bandwidth to look at the issue indepth. We will open a new ticket referencing this one, at your request. Sincerely, K2/4/2002 6:11:07 PM rkeville nhReset fails to bring processes down cleanly due to an Ingres error. - Customer schedules multiple nhResets a week, on Feb 3 2002 the maint job kicked off and failed to restart the servers and the customer lost 12 hours of data. - Checked the maint log for details: ----- Job started by Scheduler at '02/03/2002 19:30:16'. ----- ----- $NH_HOME/bin/nhReset ----- Stopping Network Health servers. 19:30:24 Waiting for Network Health processes to exit 19:30:30 Waiting for Network Health processes to exit 19:30:39 Waiting for Network Health processes to exit 19:30:51 Waiting for Network Health processes to exit Network Health processes are not running nhServer requires an existing, accessible database. However, INGRES returned the following error when an attempt was made to access 'nethealth': E_US0049 Invalid internal data prevents database access. (Sun Feb 3 19:31:13 2002) Please specify an existing database: nethealth Make sure that the database storage is mounted and accessible. - There is nothing in the errlog.log about this. - Collect cust data and collect db data are in the escalated tickets dir. Note: I will be on-site from Tues Febuary 5 - Thurs Febuary 7. ######################################################### 2/5/2002 10:55:20 AM yzhang Jim, I think Bob is with you today, right? Can you send me the following: 1) nhReset script (please make sure to run nhReset after doing the source) 2) sh -x nhReset >& nhReset.out 3) login as ingres 4) ls -l $NH_HOME/idb/ingres/data/default/nethealth > physical_file.out Thanks Yulun 2/5/2002 8:27:45 PM yzhang I talked with Bob (he is on site) about this problem. MCI was losing about 12 hours data because the nhServer did not come up from running nhReset on Feb. 3, bellow are the output of nhReset, the related portion of errlog also showed below. Here is how the error message on nhReset came out (customer did not set NH_RESET_INGRES): nhReset calls nhServer start, nhServer start, in turn, calls nhiCheckIngres.sh to make sure the ingres is OK before starting server. but nhiCheckIngres.sh failed on ingresExists (), because the database is not accessible at that moment. there is no error showed on errlog for Feb. 3, but there is exclusive deadlock on Jan. 15. Bob, can you collect everything from $NH_HOME/tmp (there should be ingres error file there) The other cause is that our script does not check $? value correctly. We need some discussion on this tomorrow. 2/12/2002 3:16:27 PM rtrei I loked at this ticket adn its sister ticket (59992/12102). The errlog.logs for both of these look fine. What struck me the most was that both of these systems had problems only on Sunday at 7:30. There Maint jobs shceduled for Wed & Fri ran successfully. In this ticket's case, the problem was that ingres was unavailbel. In the other ticket's case, something kept the job from running for more than an hour after the scheduler started it, indicating a heavy load of some kind. Yulun, Bob, and I agreed on the following next steps: 1. Bob to get everything he can on chron jobs, operator schedules, etc for Sunday evening. 2. Yulun to talk with Bonneau and/or Brown to get access to a test machine where we can set up a 5.0 machine and do nhReset every half hour. 3. Bob to see if MCI wills et up a test machine that duplicates their environment (first chioce) or recreates it upstairs (2nd choice). I am putting this into more info until we hear from Bob, although Yulun will be discussing setting up the test environment. 2/12/2002 3:16:37 PM rtrei . 2/19/2002 2:37:11 PM rkeville The maint job ran Friday night without a problem. ################################################ 2/19/2002 3:30:33 PM yzhang Bob, Did you hear any thing from mci about last Sunday maintenance job. and did you find anything from their environments(such as what processes are running when the system ran nhReset, what kind of system maintenance cron jobs scheduled, as Robin suggested)? Thanks Yulun 2/22/2002 2:11:49 PM yzhang Hello, I suggest de-escalate these two tickets If we don't have t< he information which can help us further research the problem. We can continue on the problem when some new information available. Yulun 2/25/2002 12:52:31 PM apier De-escalated per 2/25/2002 bug meeting. 3/22/2002 4:30:54 PM yzhang Bob, I think we have two remedies with MCI regarding maintenance job, one of them is 20841. Do you know if they still experience the same problem? I am wondering what we can do for helping them. do they have any requests or we just simply close the ticket. Yulun 4/1/2002 9:41:19 AM yzhang customer does not the problem now, the problem ticket is closed, but the call ticket is still open in case the problem come back. 5@2/6/2002 11:06:08 AM wburke -----Original Message----- From: Burke, Walter Sent: Wednesday, February 06, 2002 10:54 AM To: Zhang, Yulun; Burke, Walter Subject: RE: nhFetch # 58936 3) missing data on central for: ivunix - missing data on central: from 1/25/02 - 1/29/02 1011963599 - 1012309199 vusno251 - missing data on central: 1/7/02 - 1/11/02 1010408399 - 1010753999 1/12/02 - 1/15/02 1010840399 - 1011099599 1/18/02 - 1/23/02 1011358799 - 1011790799 dipnhpc1 - missing data on central: 1/19/02 - 1/29/02 1011445199 - 1012309199 _____________________________________ We need to migrate the data from the remote into the central. The major reason it is not there is due to multiple problems with nhFetch. 2/6/2002 11:45:49 AM wburke -----Original Message----- From: Zhang, Yulun Sent: Wednesday, February 06, 2002 11:32 AM To: Burke, Walter Subject: RE: nhFetch # 58936 find out exactly which stats1 tables need to be transfered from each remote to central, then we can write a scripts to unload them from remote, then reload to central 2/6/2002 11:59:57 AM yzhang find out exactly which stats1 tables need to be transfered from each remote to central, then we can write a scripts to unload them from remote, then reload to central 2/6/2002 3:21:41 PM wburke -----Original Message----- From: Burke, Walter Sent: Wednesday, February 06, 2002 3:00 PM To: 'Vinesh.Latchman@team.telstra.com' Cc: Zhang, Yulun Subject: Ticket # 58741 - Missing data on Central found on remotes. Vinesh, Missing data on Central found on remotes. - This derives most likely from the numerous nhFetch problems. - Now that the fetch is working, we should be able to merge the data from the remotes into the central. Action: Turn of Statistics Rollups for the time being. Yulun should be able to write a script which will copy these tables out of the remote, which once Transfered to the Central should be merged. As this is the highest priority, we will address this first. Based on the output given, we have determined the following tables must be merged into the central from the following Remotes: ( note table numbers are UTC timeStamped.) ivunix - missing data on central: from 1/25/02 - 1/29/02 1011963599 - 1012309199 nh_stats1_1011963599 nhealth table nh_stats1_1012049999 nhealth table nh_stats1_1012136399 nhealth table nh_stats1_1012222799 nhealth table nh_stats1_1012309199 nhealth table 5 tables total vusno251 - missing data on central: 1/7/02 - 1/11/02 1010408399 - 1010753999 nh_stats1_1010408399 nhealth table nh_stats1_1010494799 nhealth table nh_stats1_1010581199 nhealth table nh_stats1_1010667599 nhealth table nh_stats1_1010753999 nhealth table 1/12/02 - 1/15/02 1010840399 - 1011099599 nh_stats1_1010840399 nhealth table nh_stats1_1010926799 nhealth table nh_stats1_1011013199 nhealth table nh_stats1_1011099599 nhealth table 1/18/02 - 1/23/02 1011358799 - 1011790799 nh_stats1_1011358799 nhealth table nh_stats1_1011445199 nhealth table nh_stats1_1011531599 nhealth table nh_stats1_1011617999 nhealth table nh_stats1_1011704399 nhealth table nh_stats1_1011790799 nhealth table 15 tables total dipnhpc1 - missing data on central: 1/19/02 - 1/29/02 1011445199 - 1012309199 nh_stats1_1011445199 nhealth table nh_stats1_1011531599 nhealth table nh_stats1_1011617999 nhealth table nh_stats1_1011704399 nhealth table nh_stats1_1011790799 nhealth table nh_stats1_1011877199 nhealth table nh_stats1_1011963599 nhealth table nh_stats1_1012049999 nhealth table nh_stats1_1012136399 nhealth table nh_stats1_1012222799 nhealth table nh_stats1_1012309199 nhealth table 11 tables total 2/7/2002 5:21:26 PM yzhang Walter, here is the steps (6 steps) to unload from ivunix and load into central for one stats1 table, It would be great if somebody can work with you to write a script 1) sql for unloading missing ivunix stats1 table: copy table nh_stats1_1011963599() into 'nh_stats1_1011963599.dat'\g 2) tar the five dat files 3) ftp to cnetral 4) untar in the central 5)create the stats1 table on the central if it does not exist, then load this table with data from remote, and create index for this table: create table nh_stats1_1011963599 as select * from (one of the stats1 table on the central) where 1=2\g copy table nh_stats1_1011963599() from 'nh_stats1_1011963599.dat'\g create unique index nh_stats1_1011963599_ix1 on nh_stats1_1011963599(sample_time, element_id) with structure = btree, nocompression, key = (sample_time, element_id), nonleaffill = 100, leaffill = 100, fillfactor = 100, location = (ii_database)\g create unique index nh_stats1_1011963599_ix2 on nh_stats1_1011963599 element_id, sample_time) with structure = btree, nocompression, key = (element_id, sample_time), nonleaffill = 100, leaffill = 100, fillfactor = 100, location = (ii_database) 6) update nh_rlp_boundary table 2/7/2002 5:43:50 PM yzhang If table already exist, only thing you need to do is to load the data into the existing table without updating rlp_boundary table do the following to update nh_rlp_boundary table for nh_stats1_1011963599 insert into nh_rlp_boundary values('ST',1,1011963599 - 86400,1011963599,1011963599 - 86400,1011963599,' ',0) Can I see the script and the operation steps before you send to customer. Thanks Yulun 2/7/2002 7:51:41 PM wburke -----Original Message----- From: Burke, Walter Sent: Thursday, February 07, 2002 7:35 PM To: 'Vinesh.Latchman@team.telstra.com' Subject: FW: Ticket # 58741 - Missing data on Central found on remotes. # # Create Table echo "create table nh_stats1_1011963599 as select * from (one of the stats1 table on the central) where 1=2\g" |sql $NH_RDBMS_NAME >> copyIn.log # # Copy Table In echo "copy table $table_name () from '$table_name.dat'\g | sql $NH_RDBMS_NAME >>copyIn.log # # Create Index 1 echo "create unique index $table_name_ix1 on $table_name(sample_time,element_id) with structure = btree,nocompression, key = (sample_time, element_id),< nonleaffill = 100,leaffill = 100, fillfactor = 100,location = (ii_database)\g" | sql $NH_RDBMS_NAME >> copyIn.log # # Create Index2 echo "create unique index $table_name_ix2 on $table_name(sample_time,element_id) with structure = btree,nocompression, key = (sample_time, element_id),nonleaffill = 100,leaffill = 100, fillfactor = 100,location = (ii_database)\g" | sql $NH_RDBMS_NAME >> copyIn.log # # Update rlp_boundary echo "insert into nh_rlp_boundary values('ST',1,$number - 86400,$number,$number - 86400,$number,' ',0)" | sql $NH_RDBMS_NAME >> copyIn.log 2/11/2002 7:04:58 PM wburke Customer is still missing data. - Need yulun to discuss. 2/12/2002 12:21:06 PM wburke Copy in does not work. need to escalate. 2/12/2002 12:50:01 PM yzhang Walter, did you ask Vinesh run the exact commands as show below, this is not going to work, you mentioned the copy in fail, fail on what query, what is the error. Don't escalate this yet 2/12/2002 3:46:51 PM yzhang get the following from each remote and central as soon as possible: echo"select table_name, num_rows from iitables where table_name like '%nh_stats%' and table_name not like '%ix%' order by table_name\g" | sql $NH_RDBMS_NAME > remeote_hostname.out 2/12/2002 10:50:33 PM yzhang worked with customer tonight and trying to load all of the missing data, now all data on ivunix remote has been loaded to central, customer will use the similar method to load some missing data from other remotes. Then he can run reports to confirm existance of the data 2/13/2002 8:29:18 AM wburke -----Original Message----- From: Latchman, Vinesh [mailto:Vinesh.Latchman@team.telstra.com] Sent: Wednesday, February 13, 2002 1:02 AM To: ''Zhang, Yulun ' '; ''Burke, Walter ' ' Subject: RE: Telstra which NEED to be addressed TODAY. Heres the attached files -----Original Message----- From: Latchman, Vinesh To: 'Zhang, Yulun '; Latchman, Vinesh Cc: 'Burke, Walter ' Sent: 2/13/02 4:55 PM Subject: RE: Telstra which NEED to be addressed TODAY. Yulun, Walter, We are not making much headway into missing data issue. The attached files show that 5 stats table for remote ivunix maybe in the database. However the attached pdf report shows that the loaded data are not being reported. I did the following : 1). copied the 5 stats table at the remote ivunix 2). ftp the sts files from remote into central. 3). created 5 tables 4). copied the 5 stats table from ivunix 5). created index 1 & 2 6). updated the rlp boundary 7). at the central , did select count(*) on each of the 5 tables being added from remote , refer to attached files for detail on commands & output. 8). run a report for an element from remote ivunix bewteen 25th January and 29th January. Refer to attached pdf file. 9). run the same report at remote ivunix and show data on the report. I am waiting to see that missing data from remote ivunix is being reported before I repeat the above procedure with 15 tables from vusno251 and 11 tables from dipnhpc1. Whats the plan to move forward ? 2/13/2002 11:18:59 AM yzhang Vinesh, Can you get the following from both ivUnix and central site: from ivunix echo "select * from nh_rlp_boundary where rlp_type ='ST'\g" | sql $NH_RDBMS_NAME > rlp_ivunix.out echo "select * from nh_rlp_boundary where rlp_type ='ST'\g" | sql $NH_RDBMS_NAME > rlp_central.out I want to make sure the boundary table on the central did update properly. Thanks Yulun 2/14/2002 10:34:23 AM wburke -----Original Message----- From: Burke, Walter Sent: Thursday, February 14, 2002 10:24 AM To: Zhang, Yulun Subject: Ticket # 58741 - PT# 20869 Obtained. ________________ 2/14/2002 11:24:55 AM yzhang The problem of missing element is due to that the nh_rlp_bounday entries on the ivunix has not been merged into central, you need to run the following on the central machine: 1) sql $NH_RDBMS_NAME 2)insert into nh_rlp_boundary values ('ST', 1,1011877200,1011963599,1011877254,1011963599,' ',0)\g commit\g insert into nh_rlp_boundary values ('ST', 1, 1011963600,1012049999,1011963600,1012049999,' ',0)\g commit\g insert into nh_rlp_boundary values ('ST', 1, 1012050000,1012136399,1012050000,1012136399,' ',0)\g commit\g insert into nh_rlp_boundary values ('ST', 1,1012136400,1012222799,1012136400,1012222799,' ',0)\g commit\g insert into nh_rlp_boundary values ('ST', 1,1012222800,1012309199,1012222890,1012309199,' ',0)\g commit\g there should be no error for each insert, let us know if you see error. the query has been tested 2/14/2002 6:36:27 PM yzhang Just receive a call from customer, saying after running the script I sent this morning, he can see the missing data on the central. Now there is still some data missing from other remotes, but I will write him an instruction for taking care it. He said he can do it. This problem definitely can be de-escalated Thanks Yulun 2/14/2002 7:03:36 PM yzhang here is the instruction for you to handle the other remotes please handle each remote separately. example : you transfer nh_stats1_1000344444 table from remote dip (remote system name) to central. 1) you finish the transfer 2) on the remote sql $NH_RDBMS_NAME select * from nh_rlp_boundary where max_range = 1000344444 and rlp_type= 'ST' and rlp_stage_nmbr = 1 3) on the central, insert into nh_rlp_boundary values (output of above select), you need to format it and looks like the insert statement I send to you. Walter, can you help him with this. Thanks 2/15/2002 11:31:58 AM yzhang here is the instruction for you to handle the other remotes please handle each remote separately. example : you transfer nh_stats1_1000344444 table from remote dip (remote system name) to central. 1) you finish the transfer 2) on the remote sql $NH_RDBMS_NAME select * from nh_rlp_boundary where max_range = 1000344444 and rlp_type= 'ST' and rlp_stage_nmbr = 1 3) on the central, insert into nh_rlp_boundary values (output of above select), you need to format it and looks like the insert statement I send to you. Walter, can you help him with this. Thanks 2/19/2002 11:54:27 AM yzhang Walter, we did not get the ascii data for the trend report from Vinesh, can you write him the detail step regarding how to run ascii report, do a practice on the procedure you will send out. most likely their data is out of order. Thanks Yulun 2/19/2002 10:11:29 PM yzhang run the following from remote ivunix echo "select table_name, num_rows from iitables where table_name like '%nh_stats%' and table_name not like '%ix%' order by table_name\g" | sql $NH_RDBMS_NAME > ivunix_stats.out from central machine echo "select table_name, num_rows from iitables where table_name like '%nh_stats%' and table_name not like '%ix%' order by table_name\g" | sql $NH_RDBMS_NAME > central_stats.out 2/21/2002 4:37:50 PM wburke loaded central for Yulun, on nevada.concord.com 2/21/2002 4:38:06 PM wburke loaded central for Yulun, on nevada.concord.com 3/4/2002 2:37:41 PM yzhang this is about the extra line showing on the report running on the central site. you have his central db loaded, can you reproduce the report, then let me know I will bring a report person to look at it. Thanks Yulun 3/4/2002 7:11:18 PM wburke -----Original Message----- From: Burke, Walter Sent: Monday, March 04, 2002 7:01 PM To: Zhang, Yulun; Chapman, Sheldon Subject: RE: 20869/58709 I don't have the central db after the merge. Do we need it or can we just look at the stats tables? 3/5/2002 12:57:03 PM yzhang I think we might be able to reproduce the problem bt running the same reports with the database you loaded, just try to run the same report and see what you can get, then we will reassign this to report person. Yulun 3/6/2002 9:27:26 AM yzhang I think Walter need the following: 1) echo "select * from nh_element _core where name = 'arm1.Armidale-RH' \g " | sql ehealth > elem.out 2)echo "select table_name, num_rows from iitables where table_name like '%nh_stats%' and table_name not like '%ix%' ord< er by table_name\g" | sql ehealth >stats_table.out 3/7/2002 12:00:36 PM wburke -----Original Message----- From: Burke, Walter Sent: Thursday, March 07, 2002 11:50 AM To: 'Latchman, Vinesh'; Zhang, Yulun Subject: RE: Ticket # 58741 - Missing Data Vinesh, We will need the following tables: echo " copy table nh_stats1_1014643799 () into '$NH_HOME/tmp/nh_stats1_1014643799.dat'\g " | sql nethealth echo " copy table nh_stats1_1014730199 () into '$NH_HOME/tmp/nh_stats1_1014730199.dat'\g " | sql nethealth Thanks, Walter 4/4/2002 4:49:44 PM wburke Closed this it has been over 6 weeks. The data is gone., 4/4/2002 4:53:14 PM yzhang closed |2/6/2002 11:19:08 AM wburke The files attached are the tables I copied from the database. BAFS/58379/2.6.02 Analysis: There are multiple entries of groups and group lists named "all" which cause the console to crash when our customer selects one of them. When he selects another group, the console does not crash and he can generate reports without any problem. I'm unsure if i can delete all the groups named "all", since I do not want to loose any data. Please have a close look at the files and tell us which rows in which tables we may delete in the sql interface without causing any data losses. 2/12/2002 11:49:27 AM wburke From Install log ============= There are currently only 3237655 KBytes of disk space available in '/opt/nethealth/idb'; at least 37175720 KBytes are required. Do you wish to use /dev/md/dsk/d11 anyway? 'no' will exit install? [y] AND Converting database nethealth Fatal database error: REV5_0 step 50 in rev 9 16-Jan-2002 11:29:50 - Database error: -30210, E_US1194 Duplicate key on INSERT detected. (Wed Jan 16 05:29:50 2002) 2/12/2002 12:30:59 PM wburke This may be the same as 19887 2/12/2002 3:23:43 PM schapman Cust Sensitivity upgrade to 5.0.2 is failing due to this problem 2 additional ICS customers are experienceing the same problem. 2/12/2002 4:15:39 PM yzhang The file you posted for feb.6 is unreadable, can you collect the same file again, also where is the upgrade log 2/13/2002 10:47:40 AM wburke Tuesday, February 12, 2002 12:30:59 PM wburke This may be the same as 19887 Sent one-off awaiting reply. 2/13/2002 11:19:12 AM rtrei Walter-- The short answer is that the 'All' group should not have any elements directly associated with it. An 'All' group is supposed to contain all the elements of the specified group type currently on the system. It is generated at the time it is referenced, and it should never be editted. If the customer is seeing this in the console, then he must have created an 'all' group, and I believe this can only be done by creating the 'all.grp' file. (But I'm not positive.) There really shouldn't be an 'All' group and an 'all' group-- it will not work well as the customer is finding out. So, the best thing to do is to delete the 'all' group and replace its usage in any scheduled jobs with the 'All' group. I can write a script to do that, but before we proceed, I do want to look at the customer's data to be sure I am understanding the problem correctly. Can you get the following for me. echo "copy table nh_group () into 'nh_group.dat\g' | sql nethealth echo "copy table nh_group_list () into 'nh_group_list.dat\g' | sql nethealth echo "copy table nh_group_list_members () into 'nh_group_list_members.dat\g' | sql nethealth echo "copy table nh_group_members () into 'nh_group_members.dat\g' | sql nethealth echo "copy table nh_subject () into 'nh_subject.dat\g' | sql nethealth these are binary files so they must be ftp'd in binary. Make sure the customer sources his nethealth, and the usual stuff before running the commands. thanks. 2/13/2002 1:12:29 PM rtrei Walter-- After looking over the information in the escalated tickets directory I believe that the problem is definately the 'all' group. The customer shoudl run the following script: echo "delete from nh_subject where subject_id in (1000069, 1000070, 1000071, 1000072); commit\g" | sql nethealth Please give me another text dump of the nh_subject table so that we can confirm those rows are gone. Then the console should not crash. However, using an 'all' group is not supported. I recommend that the customer try to rename this or get rid of it at his earliest conveniance. Let me know if this doesn't solve the problem. I am going to need to write up something to update the knowledge base for this error. ----- 2/13/2002 1:26:34 PM wburke requested info. echo "delete from nh_subject where subject_id in (1000069, 1000070, 1000071, 1000072); commit\g" | sql nethealth then echo "select * from nh_subject order by subject_id\g " |sql nethealth > nhSubj.out Send .out file. 2/19/2002 10:06:56 AM wburke ####################################################################### # # # I C S - N e t w o r k H e a l t h - T i c k e t # # ========================== # ####################################################################### Dear Concord Support, for our Customer DG BANK, Frankfurt we CLOSED the following Trouble Report: =============================================================== Upgrade failed =============================================================== ICS REQUEST ID: ICSREQ003759 YOUR TICKET / LOG ID: 58739, (duplicate58776) CONTACT-DATA: ================= SUBMITTER: CSC CUSTOMER NAME: DG BANK TOWN: Frankfurt PLATFORM DATA: ============== HOST NAME: dfwnh1 HOST ID: 80c7adda IP ADDRESS: 10.6.1.15 VENDOR: Sun OPERATING SYSTEM: Solaris 2.8 HARDWARE MODEL: Ultra 60 MEMORY: 1024 MB LICENSE DATA: ============= ICS CONTRACT ID: 000700 CUSTOMER/CONTRACT ID: 000805 VERSION: 4.8 INSTALLED PATCHES: D03, P03 ADDITIONAL INFORMATION: Number of Elements: 1000 ICS REQUEST DATA: ================= TYPE: Problem PRIORITY: 2 STATUS: Closed LONG DESCRIPTION: - Eine Nethealth Upgrade 4.8 nach 5.0.2 failed with the error message given below: - free disk space: 3,4 GB Checking saved writable files. --------------------------------------- OpenIngres II 2.0 Reconfiguration Shutting down OpenIngres II 2.0 (/opt/nethealth/idb) for reconfiguration INGRES reconfiguration complete. Starting INGRES. --------------------------------------- Install eHealth database Started polling at Wed Jan 16 11:28:11 MET 2002...Polled data temporarily written to /opt/nethealth/tmp Waiting for poller initialization to complete .. Poller initialization completed Converting database nethealth Fatal database error: REV5_0 step 50 in rev 9 16-Jan-2002 11:29:50 - Database error: -30210, E_US1194 Duplicate key on INSERT detected. (Wed Jan 16 05:29:50 2002) 16-Jan-2002 11:29:50 - Database error: -30210, E_US1194 Duplicate key on INSERT detected. (Wed Jan 16 05:29:50 2002) The database nethealth has not been converted. You will not be able to run eHealth with this database. You can continue the installation, skipping the remaining database steps. You won't be able to run eHealth until the database problem is resolved. When the database problem is resolved, you can restart the installation, and it will handle the remaining database steps. So you can: 1) stop now. 2) continue, but skip the database steps. 3) specify another database What is your choice? (1|2|3) [3] SOLUTION: Concord script to remove duplicates in group-tables worked __________________________________________________________ Closed. Creating KnowledgeBase Solution. 2/25/2002 11:37:43 AM apier De-escalated per bug meeting on 2/25/02 Work to be scheduled for P03 per Peggy. 3/22/2002 5:12:11 PM hbui Put the fix in patc< h 3 yF2/6/2002 3:30:05 PM foconnor Customer gets duplicate keys on scheduled fetches. Job started by Scheduler at '30/1/2002 07:26:42'. ----- ----- $NH_HOME/bin/nhFetchDb ----- ### Beginning Fetch Wed Jan 30 07:26:47 EST 2002 ENTRY> netmgrn1.muc nethealth / nethealth Connecting to host netmgrn1.muc Host netmgrn1.muc is alive FTP connection successful to host netmgrn1.muc Copying files from host netmgrn1.muc netmgrn1.muc:://Remote.tdb.01-30-2002_05.02.30 tar: blocksize = 20 netmgrn1.muc:://Remote.tdb.01-30-2002_07.02.30 tar: blocksize = 20 Done copying files from netmgrn1.muc Disconnecting from host netmgrn1.muc Disconnected from host netmgrn1.muc ENTRY> netmgrh1.w1 nethealth / nethealth Connecting to host netmgrh1.w1 Host netmgrh1.w1 is alive FTP connection successful to host netmgrh1.w1 Copying files from host netmgrh1.w1 netmgrh1.w1:://Remote.tdb.01-30-2002_05.02.24 tar: blocksize = 20 netmgrh1.w1:://Remote.tdb.01-30-2002_07.02.24 tar: blocksize = 20 Done copying files from netmgrh1.w1 Disconnecting from host netmgrh1.w1 Disconnected from host netmgrh1.w1 ENTRY> netmgrh2.w2 nethealth / nethealth Connecting to host netmgrh2.w2 Host netmgrh2.w2 is alive FTP connection successful to host netmgrh2.w2 Copying files from host netmgrh2.w2 netmgrh2.w2:://Remote.tdb.01-30-2002_05.02.03 tar: blocksize = 20 netmgrh2.w2:://Remote.tdb.01-30-2002_07.02.02 tar: blocksize = 20 Done copying files from netmgrh2.w2 Disconnecting from host netmgrh2.w2 Disconnected from host netmgrh2.w2 ENTRY> netmgrh3.w3 nethealth / nethealth Connecting to host netmgrh3.w3 Host netmgrh3.w3 is alive FTP connection successful to host netmgrh3.w3 Copying files from host netmgrh3.w3 netmgrh3.w3:://Remote.tdb.01-30-2002_05.02.10 tar: blocksize = 20 netmgrh3.w3:://Remote.tdb.01-30-2002_07.02.09 tar: blocksize = 20 Done copying files from netmgrh3.w3 Disconnecting from host netmgrh3.w3 Disconnected from host netmgrh3.w3 ENTRY> netmgrh4.w4 nethealth / nethealth Connecting to host netmgrh4.w4 Host netmgrh4.w4 is alive FTP connection successful to host netmgrh4.w4 Copying files from host netmgrh4.w4 netmgrh4.w4:://Remote.tdb.01-30-2002_05.02.57 tar: blocksize = 20 netmgrh4.w4:://Remote.tdb.01-30-2002_07.02.56 tar: blocksize = 20 Done copying files from netmgrh4.w4 Disconnecting from host netmgrh4.w4 Disconnected from host netmgrh4.w4 ENTRY> netmgrh6.w6 nethealth / nethealth Connecting to host netmgrh6.w6 Host netmgrh6.w6 is alive FTP connection successful to host netmgrh6.w6 Copying files from host netmgrh6.w6 netmgrh6.w6:://Remote.tdb.01-30-2002_05.02.59 tar: blocksize = 20 netmgrh6.w6:://Remote.tdb.01-30-2002_07.03.00 tar: blocksize = 20 Done copying files from netmgrh6.w6 Disconnecting from host netmgrh6.w6 Disconnected from host netmgrh6.w6 ENTRY> netmgrh12.muc nethealth / nethealth Connecting to host netmgrh12.muc Host netmgrh12.muc is alive FTP connection successful to host netmgrh12.muc Copying files from host netmgrh12.muc netmgrh12.muc:://Remote.tdb.01-30-2002_05.02.14 tar: blocksize = 20 netmgrh12.muc:://Remote.tdb.01-30-2002_07.02.14 tar: blocksize = 20 Done copying files from netmgrh12.muc Disconnecting from host netmgrh12.muc Disconnected from host netmgrh12.muc ENTRY> netmgrh34.muc nethealth / nethealth Connecting to host netmgrh34.muc Host netmgrh34.muc is alive FTP connection successful to host netmgrh34.muc Copying files from host netmgrh34.muc netmgrh34.muc:://Remote.tdb.01-30-2002_05.02.07 tar: blocksize = 20 netmgrh34.muc:://Remote.tdb.01-30-2002_07.02.07 tar: blocksize = 20 Done copying files from netmgrh34.muc Disconnecting from host netmgrh34.muc Disconnected from host netmgrh34.muc ### Beginning Merge Wed Jan 30 07:29:39 EST 2002 Deleting the following element ids from the central database: INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (0 rows) From 100000002 to 100627085. Removing element and analyzed data after 1012363248 INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (6407 rows) (6407 rows) (15 rows) (1804 rows) (5939 rows) Deleting the following element ids from the central database: From 11000001 to 11101792. Removing element and analyzed data after 1012363060 INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1247 rows) (1247 rows) (4 rows) (746 rows) (1247 rows) Deleting the following element ids from the central database: From 12000084 to 12005375. Removing element and analyzed data after 1012363021 INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (2093 rows) (2093 rows) (0 rows) (949 rows) (1287 rows) Deleting the following element ids from the central database: From 13000001 to 13003494. Removing element and analyzed data after 1012362853 INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (294 rows) (294 rows) (0 rows) (92 rows) (203 rows) Deleting the following element ids from the central database: From 14000001 to 14001094. Removing element and analyzed data after 1012363139 INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (528 rows) (528 rows) (0 rows) (120 rows) (366 rows) Deleting the following element ids from the central database: From 16000001 to 16018396. Removing element and analyzed data after 1012363195 INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (769 rows) (769 rows) (0 rows) (447 rows) (439 rows) Deleting the following element ids from the central database: From 22000001 to 22001479. Removing element and analyzed data after 1012363318 INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1479 rows) (1479 rows) (0 rows) (153 rows) (1479 rows) Deleting the following element ids from the central database: From 44000001 to 44000809. Removing element and analyzed data after 1012363302 INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (763 rows) (763 rows) (0 rows) (742 rows) (763 rows) Checking for duplicate element names and inserting elements ... Adding remote element association, element alias and latency data ... Adding element data from files lat_b41/nea_b23/els_b45/mtf_b45 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. E_CO003F COPY: Warning: 24 rows not copied because duplicate key detected. E_CO0028 COPY: Warning: Copy completed with 1 warnings. 19 rows successfully copied. (19 rows) E_CO003F COPY: Warning: 1552 rows not copied because duplicate key detected. E_CO0028 COPY: Warning: Copy completed with 1 warnings. 5053 rows successfully copied. (5053 rows) E_CO003F COPY: Warning: 1378 rows not copied because duplicate key detected. E_CO0028 COPY: Warning: Copy completed with 1 warnings. 11723 rows successfully copied. (11723 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (119 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (315 rows) Logging elements deleted at the remote sites ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (0 rows) (0 rows) (0 rows) No cleanup of poller configuration file required. Updating servers with changes ... Creating database indexes on table nh_stats0_1012348799 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_1012352399 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_101< 2355999 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_1012359599 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_1012363199 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_1012366799 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_1012370399 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Creating database indexes on table nh_stats0_1012373999 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) (0 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1 row) Cleaning up merge files Done merging database nethealth. ### Done Wed Jan 30 07:51:22 EST 2002 ----- Scheduled Job ended at '30/1/2002 07:51:26'. ----- 2/28/2002 7:42:21 AM foconnor -----Original Message----- From: O'Connor, Farrell Sent: Thursday, February 28, 2002 7:32 AM To: Zhang, Yulun Cc: O'Connor, Farrell Subject: problem ticket 20884 Yulun, Can I get an update on the status of Problem ticket 20884? 2/28/2002 11:44:58 AM yzhang Farrell, Here is how nhFetchDb works for nh_elem_assoc table (one of the table showing duplicate message), first nhFetchDb delete from nh_elem_assoc where element_id >= ${MIN_ID} and element_id <= ${MAX_ID}; the min_id is min element_id (on the remote server id range) on the remote nh_element table the max_id is max element_id (on the remote server id range) on the remote nh_element table then nhFetchDb Insert remote elements association table data into central through copy table nh_elem_assoc (${acols}) from '${AppendDir}/${ELEMENT_ASSO. Thus the possible cause of the duplicate is that the min and max element_ids don't match between nh_element table and assoc table this is something you can explain to other customer who requires the explanation. for this customer, please get the following for assoc table for one remote (assume the server_id for this remote is 2) and central. : from 1) echo "select element_id from nh_element where element_id between 2000000 and 3000000\g" | sql nethealth > elem_id_elem_table.out 2) echo "select min(element_id), max(element_id) from nh_elem_assoc\g" | sql nethealth > elem_id_elem_assco.out hope you can do some data analysis before you pass the data to me. Thanks Yulun 4/10/2002 7:12:21 AM foconnor -----Original Message----- From: Loscher Hermann [mailto:Hermann.Loscher@fth2.siemens.de] Sent: Wednesday, April 10, 2002 4:05 AM To: 'O'Connor, Farrell' Subject: AW: Call ticket 58424 Hello, sorry that you had to wait for my answer. We have upgradet to 4.8 D7 Patchlevel 8 but we still have the problem with duplicate keys. Attached you have the logfile of the last fetch. Regards Hermann Loscher Siemens Business Services GmbH & Co. OHG 4/10/2002 7:14:37 AM foconnor From 11000001 to 11101854. Removing element and analyzed data after 1018335722 INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (1304 rows) (1304 rows) (4 rows) (799 rows) (1304 rows) Checking for duplicate element names and inserting elements ... Adding remote element association, element alias and latency data ... Adding element data from files lat_b41/nea_b23/els_b45/mtf_b45 ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. E_CO003F COPY: Warning: 24 rows not copied because duplicate key detected. E_CO0028 COPY: Warning: Copy completed with 1 warnings. 19 rows successfully copied. (19 rows) E_CO003F COPY: Warning: 1552 rows not copied because duplicate key detected. E_CO0028 COPY: Warning: Copy completed with 1 warnings. 6730 rows successfully copied. (6730 rows) E_CO003F COPY: Warning: 1378 rows not copied because duplicate key detected. E_CO0028 COPY: Warning: Copy completed with 1 warnings. 13808 rows successfully copied. (13808 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (131 rows) INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (327 rows) Logging elements deleted at the remote sites ... INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. (0 rows) (0 rows) 4/19/2002 6:42:01 PM yzhang Farrell, Here is something I wrote to you last time, They see the same error after upgrade, but you still can use the following hint to investigate the problem. Here is how nhFetchDb works for nh_elem_assoc table (one of the table showing duplicate message), first nhFetchDb delete from nh_elem_assoc where element_id >= ${MIN_ID} and element_id <= ${MAX_ID}; the min_id is min element_id (on the remote server id range) on the remote nh_element table the max_id is max element_id (on the remote server id range) on the remote nh_element table then nhFetchDb Insert remote elements association table data into central through copy table nh_elem_assoc (${acols}) from '${AppendDir}/${ELEMENT_ASSO. Thus the possible cause of the duplicate is that the min and max element_ids don't match between nh_element table and assoc table on the central site for a specific remote poller. The another cause is that the fetch fails after inserting nh_element table on the central, but before inserting nh_elem_assoc table, you may want to check with customer regarding their fetch for this customer, please get the following for assoc table for one remote (assume the server_id for this remote is 2) and central. : from 1) echo "select element_id from nh_element where element_id between 2000000 and 3000000\g" | sql nethealth > elem_id_elem_table.out 2) echo "select min(element_id), max(element_id) from nh_elem_assoc\g" | sql nethealth > elem_id_elem_assco.out Thanks Yulun 4/22/2002 9:48:32 AM mmcnally Requested more info. 5/3/2002 3:29:41 PM mmcnally Yulun, The requested info is on BAFS/58000/58424/may3. 5/19/2002 12:21:36 PM yzhang please get a script called select_fetch_dup_20884.sh from ~yzhang/scripts on system sulfur customer needs to run the script just by typing the script name, and after finishing the script send the extra_elem.out file located on $NH_HOME/tmp. Note, Customer need to run this script on each of the remote and central machine, the output file extra_elem.out should be renamed to reflect the host name Mike, you need to do a quick test on NT for nh47 before sending it to customer copy to me the email you will send to customer. Thanks Yulun 5/20/2002 10:10:45 AM cestep Sent the script to the customer. 5/21/2002 6:43:15 PM yzhang Colin, don't wait for the information, please have customer run the following script to make all the element match between the tables. type the script name to run the script, after completion, looking for a file called elemMatch.out in the directory where they run the script, make sure there is no error in this out file, run this scrip on each remote site and central site, stop nhServer before running this script. after completing running the script, start nhServer, and the regular remote save and fetch. Because this customer is on 471, it is not patchable. I will cooparate the script into nhRemoteSaveDb, and patch i< t for 48 and 50. Thanks Yulun 5/22/2002 10:00:12 AM cestep Sent the request to the customer. 5/24/2002 6:31:19 PM yzhang Duplicate error appears when fetching a database from remote site where some elements in nh_elem_assoc, but not in element_core table. We don't know exactly why this mismatch occurs. The code change here is to place a dialogistic message when the mismatch occurs. 9/20/2002 11:23:09 AM rsanginario Guys, I just ran this test. After deleting rows and Saving the Db I did get the Warning message. However after matching the elements back up and doing a RemoteSaveDb the Warning message still appeared. 9/20/2002 4:36:06 PM rsanginario Yulun suggested doing the following as part of the test plan: Randy, here is what you mentioned after matching elements between the tables (via deleting all parent rows effected by the mismatch) and then running remotesavedb again, How did you deleting all parent rows effected by the mismatch try the following, then run remoteSave again delete from nh_elem_assoc where element_id not in (select element_id from nh_element_core); delete from nh_elem_assoc where parent_id not in (select element_id from nh_element_core); delete from nh_elem_alias where element_id not in (select element_id from nh_element_core); delete from nh_elem_latency where element_id not in (select element_id from nh_element_core); delete from nh_element_aux where element_id not in (select element_id from nh_element_core)\g Yulun I did this and the warning message went away. I will pass this test. c2/6/2002 3:46:18 PM jnormandin The install is currently down. - Running 5.02 - Unsure of when last db save - Had previous ticket ( 58051 ) where ingres was corrupted due to system backups while ingres was running . This was associated with problem ticket # 20102 and closed as No-bug due to back-up strategy being used. - He disagress saying they never concluded that was the cause. - No tape backups being done on this box, but there are scheduled database saves - They are backing that save up via tape, but not the active transaction log ( may have been in the past, but not currently ) - STEPS TAKEN ps -ef | grep ii - iigcn - dmfrcp, acp - iidbms - sql iidbdb - could not open - rollforwarddb iidbdb - Error initializing - Error below. -------------------------------------------------------------------------------------------------------------------------------------------------------- C:\>rollforwarddb iidbdb Wed Feb 06 12:07:22 2002 E_DM9411_DMF_INIT Error initializing the DMF work ing environment. Wed Feb 06 12:07:22 2002 E_DM1051_JSP_NO_INSTALL The installation is not urrently running. Run INGSTART then try again. Wed Feb 06 12:07:22 2002 E_DM1001_JSP_INIT An error occurred initializing he journal support program. Make sure program is installed and/or you are runni g with proper privilege. ------------------------------------------------------------------------------------------------------------------------------------------------------- - ingstop -kill - Unable to remove shared memory location. - ipcs - nothing allocated - lets kill of the ingres processes manually - done. - verified no ingres processes running. - ipcclean - ok. - blow away the current trans log and recreate a blank file - ok - ingstart - would not start - II_SYSTEM is not set error message - ingprenv - environment looks good. - nhResizeIngresLog 500 - starting ingres server - Still trying to start, have any processes come up ? - nhiNtsCm start ingres_database - iigcn , using no cpu - CPU is at about 0. - Can we reboot the box ? - yes - Rebooting box - Ok rebooted. - Are there any ingres proceses running ? - Yes.. all 4 processes. - sql iidbdb - could not open the iidbdb database - nhResizeIngresLog 550 - log has been created. - Starting ingres servers - finished successfully. - sql iidbdb - same problem - sql ehealth - failed - sql nethealth - failed - rollforwarddb iidbdb - Same error as above. 2/6/2002 4:15:56 PM yzhang requested the verifydb on system catalog, and sysmod output 2/6/2002 4:17:23 PM yzhang requested verifydb on system catalog and sysmod output 2/7/2002 2:28:43 PM yzhang Call customer, do the following: 1) sysmod iidbdb >sys_iidbdb.out 2) nhForceDb ehealth > force.out 2/7/2002 3:17:20 PM yzhang issue was created with CA regarding inconsistant iidbdb 2/7/2002 4:59:40 PM yzhang Jason, I created an issue with CA, they need infodb iidbdb output, but this customer need to the following for getting infodb: 1) stop ingres from service 2) cbf (rm config.lck file from ingres/files if this file lock cbf) 3) from CBF window, if Remote Command count is 1 or other value, reset it to 0, ans save it) 4) start ingres from service 6) infodb iidbdb > infodb_iidbdb.out(send this file to me) Yulun 2/8/2002 9:44:18 AM yzhang Jason, Here is the request for this problem. Shut down ingres. Once ingres is shut down, make sure that the remote command server startup count is set to zero in cbf (the remote command server makes a connection to the iidbdb when ingres starts up, so I want to disable it for now). If it's set to 1 or higher in CBF (Check the 'Startup Count' column in the main screen for CBF) then select 'Edit Count' from the menu at the bottom and set it to zero. Exit CBF. Restart ingres. Once ingres is started, attempt to run infodb again. Send me the errlog.log, iircp.log, iiacp.log, symbol.tbl, config.dat, and the lic98.log files. The files are all in the files directory, except for lic98.log which is in c:\ca_lic. 2/8/2002 1:44:52 PM jnormandin Yulun. The Startup count was already set to 0. All of the files requested have been placed on bafs. Please let me know the next step as soon as possible. The customer is anxious to get up and polling asap. He does have a good backup from Feb 4th, but does not want to restore everything unless we have a cause of what originally caused the issue. Thanks! Jason 2/8/2002 4:28:39 PM jnormandin From: Normandin, Jason Sent: Friday, February 08, 2002 3:55 PM To: Zhang, Yulun Subject: 20885 Importance: High Yulun., Have you had an oppurtunity to work on this issue this afternoon? -Jason 2/8/2002 5:57:15 PM yzhang Jonas, Sorry to ask you for information again: Can you dry the following, and send all the output files 1)rollforwarddb iidbdb > rollforwarddb.out 2) ls -l $II_CHECKPOINT/ingres/jnl/default/iidbdb > jnl.out 3) ls -l $II_JOURNAL/ingres/ckp/default/iidbdb > ckp.out 4) sql iidbdb > sql_iidbdb.out Thanks Yulun 2/9/2002 1:41:28 PM yzhang here are the output of rollforwarddb, and sql iidbdb Have not got the list of II_CHEKPOINT, and II_Journal yet. Noe the rollforwarddb succeeded, but our client still can not do sql iidbdb, any suggestion? Thanks Yulun 2/11/2002 11:09:24 AM jnormandin Changing status to assigned from more info as info has been provided 2/11/2002 1:36:17 PM yzhang Jason, He can access the iidbdb and ehealth db now, after forcing ehealth, and rollforwarddb iidbdb what he need to is to save the current db, then do our normal db recycle, Can you write a procedure for him. At the same time, I am working with CA trying to find out the root cause 2/11/2002 4:57:41 PM jnormandin - Customer has loaded DB and is now polling succesfully. - We need to determine why this has occured so that it can be avoided in the future 2/12/2002 12:04:43 PM yzhang Talked to CA, they think the problem is that our client uses terminal service to access the database, which cause lock quote exceeding. They said this is a knewn problem for window 2000. What they need to do prevent the same problem happen is to not access database through terminal service. terminal service is a software defaultly installed on window 2000, they can disable the the terminal service. Talk to customer on this, they may know this better. the other thing is that can< you write up a summary on this problem, including the detail about how we take care iidbdb inconsistant issue. this information should go to dbworksheet, I want to see your writting befor putting there. Thanks Yulun 2/13/2002 3:00:53 PM jnormandin The customer was using the terminal service to access the Db - The will no longer use this service. - Closing call ticket and bug 2/7/2002 10:03:19 AM mfintonis submitted by Don Gray (Support) while onsite at Earthlink. Earthlink contact in Cody West: cody.west@corp.earthlink.net (626) 296-5783 2/11/2002 11:41:07 AM rhawkes Putting in Moreinfo so QA can retest it. 2/11/2002 3:32:24 PM Betaprogram assigned to QA for retest 2/12/2002 9:07:01 AM jdorden Testing on the oracle55_ndsol kit of Feb 11, front end timor, the Database Status window opens fine. Testing on Front End Kiska (NT Server) , Feb 1 Kit: Open the window - spikes CPU to 40 % - settles down quickly to 2 -3 % Scroll Location Name - spikes to 22% Tab Clicks spike to 7-8 % Refresh spikes to 20 % Similar results on Pitcarin - WIN 2000 Server For a simple informational window the above spikes on CPU Usage are excessive. Closing the window releases the CPU immediately. Repeated frequent Open/Close of the window does not use and hold CPU Usage. I am leaving the window open for a time to see what happens. 2/12/2002 10:51:40 AM jdorden Tested again with the Trusted Flag for the FE = no - first test did not have this set. Test results the same on FE pitcairn Win2k Server. 2/13/2002 8:57:50 AM rbonneau Re-assigning back to Rich as retest passed. 2/13/2002 11:10:26 AM rhawkes Unable to duplicate problem in-house. 12/12/2002 12:05:39 PM wburke -----Original Message----- From: Latchman, Vinesh [mailto:Vinesh.Latchman@team.telstra.com] Sent: Tuesday, February 12, 2002 12:49 AM To: Burke, Walter; 'Zhang, Yulun' Subject: RE: Ticket # 59748 - Groups Created on RemotePoller in 4.8 are no t pr esent in 5.0.2 All, We have another problem with the group. If we create a user to access only certain groups, for example user ispmail to access only : group-list = IPM-ispmail-Product groups = IPM-ispmail-router-all , ..... On the web interface when user login as ispmail, Under organisationtab, the Users can see under the group-list and group , the User can see the ispmail groups. However under elements, the User can see all the elements at the Central site instead of those that are in the ispmail group. But this does not happen under run reports, we cannot see all elements. This happens to all other Users at the Central site. 2/12/2002 12:09:09 PM wburke This was the original problem: ----- Job started by User at '12/02/2002 02:01:54 AM'. ----- ----- $NH_HOME/bin/nhReport -scheduled -rptType service -rptName BusinessUnit -subjType groupList -elemType multiTech -subjName IPM-ispmail-Product -uiNamesType names -autoRange prev4Weeks -protocols all -namesType names -chartType line -chartOpts standard -gran day -web $(SUBJECT)_$(DATE)_$(TIME) -outputDir /opt/Nethealth/output/$(_reportType)/ -pdf $(SUBJECT)_$(DATE)_$(TIME).pdf -after "nhMail Vinesh.Latchman@team.telstra.com" -jobId 1000362 -jobCount 5 ----- Error: Invalid group file 'Could not get modified time from nh_group_list table'. Report failed. select name from nh_group_list shows 'ipm-ispmail-product' - all lower case. updated table to show IPM-ispmail-Product, and now it appears to work, with the addtional error: The scheduled report run OKAY. However in the log file there was a confusing error message. "pitt7-ispmail-r01-RH is ignored in report because pitt-webhost-r01-RH is not in group IPM-ispmail-Routers-All." I checked "pitt7-ispmail-r01-RH" is in the group "IPM-ispmail-Routers-All". 2/12/2002 12:16:05 PM yzhang let's consider the basic, that is, does the fetch brings transfered the group, if not do a fetch in debug, and send the errlog. 2/12/2002 6:09:46 PM wburke Error: No elements in 'DIPC-qld-CustomerPVCs' apply to this report. Error: No elements in 'DIPC-qld-CustomerPVCs' apply to this report. 2/12/2002 7:01:17 PM wburke -----Original Message----- From: Latchman, Vinesh [mailto:Vinesh.Latchman@team.telstra.com] Sent: Tuesday, February 12, 2002 6:35 PM To: 'Burke, Walter' Cc: Zhang, Yulun Subject: RE: Telstra which NEED to be addressed TODAY. All the groups were created on the remotes. And then fetched into Central. After nhDeleteElements at central, they had to be deleted at the Central. Then fetched again into Central from remotes. This is when it became unstable : 1). scheduled health reports fail, the health reports that were scheduled in earlier 4.8 release are identified as "health" and these reports are working properly. The new health reports that we now schedule under 5.02 are identified as "health report" and these are failing. 2). scheduled service reports were failing and we had to update group_list_id. 3). On the web interface, under the organisation tab, we see all elements under elements instead of only those that are for the groups above. We are in the situtation where some scheduled reports fail because the gruo_list_id needs to be updated. And the new scheduled health reports are failing maybe because the group_list_id needs to be updated or things have changed under 5.02. When we update the group_list id, Users can no longer see this group until we re-enter the group-list under User Administration. That is update_group_list_id causes the particular group-list to fall out of the list of group-list the Users can view. And the reports need to re-scheduled as the earlier group-list has disappeared. We learnt all this through fixing a Service-Level report that was not working yesterday. Vinesh Latchman Project Manager, 2/13/2002 8:52:00 AM wburke -----Original Message----- From: Latchman, Vinesh To: 'Zhang, Yulun '; 'Burke, Walter ' Sent: 2/13/02 5:31 PM Subject: RE: Telstra which NEED to be addressed TODAY. Yulun, Walter, We have some issues regarding groups and group-list that is affecting reports being scheduled and also nhExportData. With nhExportData, one of the jobs are failing : Invalid group file 'dip_open-RouterSwitch-All'. The scheduled Health reports are failing because : Error: No elements in 'DIPC-wa-CustomerPVCs' apply to this report. Error: No elements in 'DIPC-wa-CustomerLinks' apply to this report. Error: No elements in 'DIPC-sa_nt-CustomerLinks' apply to this report. Error: No elements in 'DIPC-sa_nt-CustomerPVCs' apply to this report. Error: No elements in 'DIPC-qld-CustomerLinks' apply to this report. Error: No elements in 'DIPC-qld-CustomerPVCs' apply to this report. From the data I had emailed, you can confirm whether created at remote/central and wether it has valid id and element_ids. This is our critical issue as we cannot exportData and run reports because of the problems with the group. Please send commands/queries that I need to run to check if the groups are being properly created/fetched at the Central. From console view (Edit/Groups), it all looks okay. Which is why I have not yet deleted and created these groups. 2/13/2002 8:53:39 AM wburke -----Original Message----- From: Zhang, Yulun Sent: Tuesday, February 12, 2002 9:28 PM To: 'Latchman, Vinesh' Cc: Burke, Walter Subject: RE: Telstra which NEED to be addressed TODAY. this is the problem output from central site group table, the correct one is that the group_id range should be the same as corresponding element_id ranges which makes up the group. you have several 1 million range group that contain 5 million range element_ids, see more from the following data |group_id |group_machine|element_id |element_machi| | 1000373| 0| 5000041| 0| | 1000373| 0| 5000040| 0| | 1000373| 0| 5000039| 0| | 1000373| 0| 5000038| 0| | 1000373| 0| 5000037| 0| | 1000373| 0| 5000033| < 0| | 1000373| 0| 5000031| 0| | 1000373| 0| 2002194| 0| | 1000373| 0| 2002032| 0| | 1000373| 0| 2002029| 0| 2/13/2002 3:34:59 PM yzhang use this procedure for truncating nh_deleted_element_core table: 1) echo "copy table nh_deleted_element_core() into 'nh_deleted_element_core.dat'\g" | sql $NH_RDBMS_NAME 2) echo "modify nh_deleted_element_core to truncated\g" | sql $NH_RDBMS_NAME 3) echo "modify nh_deleted_element_core to heap\g" | sql $NH_RDBMS_NAME 2/14/2002 10:30:24 AM wburke -----Original Message----- From: Latchman, Vinesh [mailto:Vinesh.Latchman@team.telstra.com] Sent: Wednesday, February 13, 2002 9:20 PM To: 'Zhang, Yulun '; 'Burke, Walter ' Subject: RE: Telstra which NEED to be addressed TODAY. All, The problem in the email has repeated today. This is affecting scheduled reports. Any plans Vinesh Latchman Project Manager, Hosting & Internet, IS, Telstra Ph: (03) 9634 6294 > -----Original Message----- > From: Latchman, Vinesh > Sent: Wednesday, 13 February 2002 5:31 pm > To: 'Zhang, Yulun '; 'Burke, Walter ' > Subject: RE: Telstra which NEED to be addressed TODAY. > > Yulun, Walter, > > We have some issues regarding groups and group-list that is affecting > reports being scheduled and also nhExportData. > > With nhExportData, one of the jobs are failing : > Invalid group file 'dip_open-RouterSwitch-All'. > > The scheduled Health reports are failing because : > > Error: No elements in 'DIPC-wa-CustomerPVCs' apply to this report. > Error: No elements in 'DIPC-wa-CustomerLinks' apply to this report. > Error: No elements in 'DIPC-sa_nt-CustomerLinks' apply to this report. > Error: No elements in 'DIPC-sa_nt-CustomerPVCs' apply to this report. > Error: No elements in 'DIPC-qld-CustomerLinks' apply to this report. > Error: No elements in 'DIPC-qld-CustomerPVCs' apply to this report. > > From the data I had emailed, you can confirm whether created at > remote/central and wether it has valid id and element_ids. > > This is our critical issue as we cannot exportData and run reports because > of the problems with the group. > Please send commands/queries that I need to run to check if the groups are > being properly created/fetched at the Central. From console view > (Edit/Groups), it all looks okay. Which is why I have not yet deleted and > created these groups. 2/14/2002 10:30:42 AM wburke Still having problems. Back to assigned. 2/15/2002 9:59:30 AM wburke -----Original Message----- From: Latchman, Vinesh [mailto:Vinesh.Latchman@team.telstra.com] Sent: Thursday, February 14, 2002 9:36 PM To: ''Zhang, Yulun ' ' '; '''Burke, Walter ' ' ' Subject: nhFetch Group problems All, We are having the following problems every for last 3 days : Error: No elements in 'DIPC-wa-CustomerPVCs' apply to this report. Error: No elements in 'DIPC-wa-CustomerLinks' apply to this report. Error: No elements in 'DIPC-sa_nt-CustomerLinks' apply to this report. Error: No elements in 'DIPC-sa_nt-CustomerPVCs' apply to this report. Error: No elements in 'DIPC-qld-CustomerLinks' apply to this report. Error: No elements in 'DIPC-qld-CustomerPVCs' apply to this report. Through the console, I can see that the elements are in the group created at the remote. And on the console at Central , I can see elements in these group. I can go into the Central over the web interface and able to run a health report. See the attached file. However when I do the same by scheduling the same report, it gives the above error message. Can we progress this issue further as cannot be manually generating the reports, it is very time-consuming. Vinesh. 2/19/2002 10:43:31 AM mwickham -----Original Message----- From: Burke, Walter Sent: Friday, February 15, 2002 05:27 PM To: ts_esc_leads Subject: Escalate # 21040 - Groups fail. - Cust. Sensitivity 2/19/2002 1:49:42 PM yzhang Walter and Vinesh DIPC-wa-CustomerLinks,DIPC-wa-CustomerLinks,DIPC-sa_nt-CustomerLinks are the group names, right? Vinesh, Can you save the database from central machine, and place the database tar file on the ftp site. then let us know. and get the nhCollectCustData, place the collect.tar Thanks Yulun 2/20/2002 12:21:40 PM wburke -----Original Message----- From: Burke, Walter Sent: Wednesday, February 20, 2002 12:11 PM To: Zhang, Yulun Subject: RE: 20140 loading now on nevada.concord.com login vegas pass vegas root -Watw!! I will let you know when finished. 2/20/2002 6:33:46 PM wburke Loaded on nevada. see above. 2/22/2002 6:21:21 PM yzhang Walter has sent the following to customer, we believe this will solve the problem Vinesh, Modify the following jobs to run as Lan/Wan not as Lan inteface types in Console -> setup -> scheduled jobs -> modify Health Reports 1000366 - 1000372 We are having the following problems every for last 3 days : Error: No elements in 'DIPC-wa-CustomerPVCs' apply to this report. Error: No elements in 'DIPC-wa-CustomerLinks' apply to this report. Error: No elements in 'DIPC-sa_nt-CustomerLinks' apply to this report. Error: No elements in 'DIPC-sa_nt-CustomerPVCs' apply to this report. Error: No elements in 'DIPC-qld-CustomerLinks' apply to this report. Error: No elements in 'DIPC-qld-CustomerPVCs' apply to this report. Walter 2/25/2002 8:03:21 AM wburke -----Original Message----- From: Latchman, Vinesh [mailto:Vinesh.Latchman@team.telstra.com] Sent: Sunday, February 24, 2002 7:03 PM To: 'Burke, Walter' Cc: Zhang, Yulun Subject: RE: Ticket # 60033, PT# 21040 - Group Report issue Walter, I made the changes and rescheduled one of the reports to run now. There are no errors. The reports run successfully. It seems the error message below means incorrect technology type. Thanks Vinesh Latchman 2/25/2002 9:30:36 AM yzhang Customer made the changes and rescheduled one of the reports to run now. There are no errors. The reports run successfully. It seems the error message below means incorrect technology type. ticket closed T 2/13/2002 8:28:00 AM wburke nhFetch takes 4+ hours to finish. -----Original Message----- From: Latchman, Vinesh [mailto:Vinesh.Latchman@team.telstra.com] Sent: Wednesday, February 13, 2002 1:40 AM To: 'Zhang, Yulun '; 'Burke, Walter ' Subject: RE: Telstra which NEED to be addressed TODAY. Yulun, Walter, The fetch is taking too long and while the fetch is scheduled or running, all other scheduled jobs are delayed until fetch finishes. This affects scheduled reports and nhExportData that appear 5 to 6 hours later then expected. At the central, we run fetch every 12 hours at 8:00 AM/PM. The morning fetch delays export of yesterdays data and hogs up 50% of the CPU. This affetc Users accessing/running reports. The afternoon fetch affects db saves, DataAnalaysis , LiveHealth Analaysis jobs that are scheduled after the fetch. The fetch also delays the scheduled reports that run at 1:00 AM every day. The fetch process that is taking too much time is : awk \ BEGIN {save=1} FILENAME == > "deleted.names" {delist {"\""$1"\"" ... Note: the nh_deleted_element_core table may not be clearing. - last check shows 60000+ rows after fetch completed. 2/14/2002 10:11:02 AM wburke -----Original Message----- From: Latchman, Vinesh [mailto:Vinesh.Latchman@team.telstra.com] Sent: Thursday, February 14, 2002 12:58 AM To: 'Burke, Walter' Subject: RE: Ticket # 59770 - nhFetch > Walter, > > I run the commands and there was no error messages. > <> > However count on nh_delete_element_core table show 60,996 elements. > > I try nhFetchDb on one remote, however I doubt it will make any > difference. > > -----Original Message----- > From: Latchman, Vinesh > To: 'Burke, Walter' > Sent: 2/14/02 2:18 PM > Subject: RE: Ticket # 59770 - nhFetch > > <> 2/19/2002 2:23:10 PM yzhang waiting for database and nhCollect.tar file 2/19/2002 10:27:49 PM yzhang the fet< ch is ok now, this one will be closed by tommorow if there is no complain on the fetch 2/20/2002 12:28:12 PM wburke From: Zhang, Yulun Sent: Tuesday, February 19, 2002 10:20 PM To: Burke, Walter Subject: RE: Telstra issues with 5.02 walter, thanks for this summary. now the fetch is running fine (no hangs) We can close. 2/21/2002 4:22:48 PM wburke OK to close. 2/22/2002 2:05:23 PM yzhang closed 2/13/2002 11:32:50 AM cgould Customer is creating a custom mtf file to poll gauge variables from a router. The variable OID's are as follows: cvpdnSystemTunnelTotal 1.3.6.1.4.1.9.10.24.1.1.4.1.2 cvpdnSystemSessionTotal 1.3.6.1.4.1.9.10.24.1.1.4.1.3 These OID's in the MIB dump I have are both gauge values, and are remaining constant at 2 and 5, respectively. We built an mtf, and this is how we defined them: variable1 = cvpdnSystemTunnelTotal% * (deltaTime / 100) variable2 = cvpdnSystemSessionTotal% * (deltaTime / 100) variable3 = (1/(cvpdnSystemTunnelTotal%)) * (deltaTime / 100) variable4 = (1/(cvpdnSystemSessionTotal%)) * (deltaTime / 100) Then, in the columnExpression.usr, we have the following calculations (for some odd reason, he sees value in the inverses of these OID's as well): 1000001|DLL_FRAMES 1000002|(1/(DLL_FRAMES)) 1000003|DLL_BYTES 1000004|(1/(DLL_BYTES)) 1000005|DLL_MCASTS 1000006|DLL_BCASTS The output for DLL_FRAMES, DLL_BYTES, DLL_MCASTS and DLL_BCASTS are all what we'd expect to see: 2, 5, .5 and .2. However, the problem seems to be with the 1/DLL_FRAMES and 1/DLL_BYTES. For some strange reason, as opposed to doing the calculations, and outputting .5 and .2, they are actually outputting the following: 1/(DLL_FRAMES*DELTA_TIME*DELTA_TIME) and 1/( DLL_BYTES*DELTA_TIME*DELTA_TIME) Here is some sample output from and nhExportData: "","keg:4801_sp1.n-4801-RH-FastEthernet1/0","keg:4801_sp1.n-4801-RH-FastEthernet1/0",100000000,100000000,"02/08/2002 02:03:27 PM",283,283,2,0.00000624306,5,0.00000249722,0.5,0.2 2/21/2002 10:21:25 AM cgould -----Original Message----- From: Michael Whyatt [mailto:michael.whyatt@uk.logical.com] Sent: Thursday, February 21, 2002 10:05 AM To: support@concord.com Subject: CallT0000059546 Hello Support, Currently this call is open pending fix. Our customer has asked us to enquire if there has been any developments regarding this fix? At this point he would be quite happy to know that you have diagnosed the issue even if you do not know when a fix may be available. Currently they may be able to proceed with their project with the supplied workaround however they are concerned that they make encounter further problems with the calculations when it goes live. Regards Mike 2/21/2002 10:21:46 AM cgould -----Original Message----- From: Gould, Chad Sent: Thursday, February 21, 2002 10:11 AM To: Martin, Jeff Subject: FW: CallT0000059546 Jeff- Do you have any idea when you'll have a chance to take a look at this? 2/25/2002 1:12:20 PM cgould -----Original Message----- From: John Farebrother [mailto:jfarebrother@concord.com] Sent: Monday, February 25, 2002 11:59 AM To: Rob McCabe; Chad Gould Cc: awaterhouse@concord.com Subject: Call 59546 Importance: High Chad / Rob Pending ticket 59456 - bug/fix has reached a major issue within British Telecom. Roll-out of the project within Exact has stopped as of this morning. The roll-out of the project is worth about $400k over the next three quarters. They are aware that it has a low priority because a workaround exists however, they are not willing to deploy live until they have the true fix. Obviously we cant just knock this up overnight, so in lieu of this, I need a definitive date to be able to go back to them with. I am aware that it shows as '6.0 pending'; but in the light of the stopped project and the financial penalty; can we please bring this forward? Thanks John 2/25/2002 3:02:15 PM jlennox We have asked for some files so that we can evaluate this. all .usr files, the mtf This looks like a user error rather than a custom variable error. We don't write the columnExpression.usr file. once we get the files we will try to figure out what they did wrong. Jim 2/26/2002 11:43:47 AM cgould -----Original Message----- From: Gould, Chad Sent: Tuesday, February 26, 2002 11:33 AM To: Lennox, James Subject: ProbT0000021079 James- I couldn't help but notice that this bug is still in MoreInfo status. Did you receive my e-mail yesterday? 2/26/2002 5:51:54 PM rdiebboll Jim Lennox and looked at the .usr files and saw that they are creating a custom element type and column expressions. This has nothing to do with the eHDP Custom Variable utility. So, we think that the customer must be getting the math wrong in the expressions. We did not see the MTF, so this is an assumption on our part since we couldn't see how the variables in the MTF are mapped to the ones in columnExpression.usr. 2/26/2002 7:43:10 PM cgould -----Original Message----- From: Gould, Chad Sent: Tuesday, February 26, 2002 7:33 PM To: Diebboll, Rob Subject: ProbT0000021079 Rob- The customer is not getting the math wrong in the expressions. I would like to note, once again, that when I simulate their mib dump, I get the exact same results they are seeing. I'd be more then happy to show you. I gave Jim the customer's mtf file, and the excerpt from the one I used to recreate the problem in house is at the top of the problem ticket log. What they are trying to do is very simple, but the results are unexpected. All they are trying to output are the values polled from the following OID's, and there inverses: cvpdnSystemTunnelTotal 1.3.6.1.4.1.9.10.24.1.1.4.1.2 cvpdnSystemSessionTotal 1.3.6.1.4.1.9.10.24.1.1.4.1.3 You'll see near the top of the problem ticket log, I included the excerpt from their (and my) columnExpression.usr file: 1000001|DLL_FRAMES 1000002|(1/(DLL_FRAMES)) 1000003|DLL_BYTES 1000004|(1/(DLL_BYTES)) The first and the third ones seem to work fine, they output the values as expected. However, the second and the fourth ones do not seem to work correctly. When DLL_FRAMES (variable1) outputs a value of 2, 1/DLL_FRAMES does not output 0.5, but rather 0.00000624306, which ironically is 1/(2*DELTATIME*DELTATIME). The sample output, which is in the problem ticket log, of a nhExportData performed on my simulation shows this. The inverse of DLL_BYTES exhibits this same behavior. 2/27/2002 10:45:06 AM dbrooks see above 2/27/2002 1:41:20 PM jay I spoke with Rob on the phone about this one. Here is how the thing works. When the user defines a gauge in the MTF they MUST apply a factor of deltaTime to the formula. It is expressed as deltaTime/100 because the concept of deltaTime from the MTF standpoint is in centiseconds. That raw value * deltaTime now resides in a column (DLL_FRAMES) If you trend on any formula we ALWAYS divide by DELTA_TIME. So, that end expression is cvpdnSystemTunnelTotal% * (deltaTimeCentiSecs / 100)/ DELTA_TIME_SECS We always divided by deltaTime to either convert a counter to a rate or to denormalize a gauge. This works fine if the expression only includes the gauge. If you include the gauge in a formula (1/gauge or gauge1/gauge2) you need to be cogniscent of this implicit division and offset it with other DELTA_TIME. 1/DLL_FRAMES turns into 1/(cvpdnSystemTunnelTotal% * (deltaTimeCentiSecs / 100)) add the implicit division and you get (1/(cvpdnSystemTunnelTotal% * (deltaTimeCentiSecs / 100)))/DELTA_TIME or 1/((cvpdnSystemTunnelTotal% * (deltaTimeCentiSecs / 100))*DELTA_TIME) which is what the user is seeing. To compensate and flush out both the deltaTime factor added as a result of the MTF and the implicit deltaTime, they would need to change the formula to: 1*DELTA_TIME*DELTA_TIME/DLL_FRAMES 3/20/2002 9:38:25 AM jay This is as designed and the user has to work around this. I think support has two options: 1) An improvement request should be entered to give the user a better notation so they don't have to think< about implicit division. or 2) this bug could be turned over to documentation to document a reminder on the ugliness. 3/20/2002 9:38:34 AM jay , 3/20/2002 10:04:44 AM jay -----Original Message----- From: Gould, Chad Sent: Wednesday, March 20, 2002 8:35 AM To: Wolf, Jay Subject: FW: ProbT0000021079 has been changed to MoreInfo Hi Jay- I had submitted an enhancement to documentation on this. As far as I and the customer are concerned, it can be closed. By the way, thanks for your help on this thing. It had baffled me and many others, but you were able to give a great description of what was going on. @2/14/2002 8:24:02 AM Betaprogram Jayde Hanley of Empowered Networks for Alcatel Canada jayde.hanley@empowerednetworks.com 613-271-7971 Migration testing from 5.0.2 to 5.5. Successfully installed Oracle and eHealth 5.5. Ran migration scripts successfully. Ran nhCreateDb to create a LARGE database. nhCreateDb failed. Here is the full screen output: # ./nhCreateDb -------------------------------------------------------------------------- --- Distributed Console --------------------------------------- Distributed consoles are used only in an eHealth clustered environment. Distributed consoles do not poll and cannot discover elements. For more information, refer to the eHealth Installation Guide. Do you want to install this system as a distributed console? [n] Please select whether you want to install using the small medium large or XLarge model. This choice will determine the set of sizes used to create your tablespaces and tables. Small <= 3,000 elements Medium <= 10,000 elements Large <= 25,000 elements Extra Large > 25,000 elements 1) small 2) Medium 3) LARGE 4) XLARGE Please enter the number of your selection : 3 -------------------------------------------------------------------------- --- Database Directories --------------------------------------- Oracle databases require the creation of a number of tablespaces distributed over several disks. In order to create the database, the install program needs to know which directories to create these tablespaces in. eHealth supports between 1 and 9 directories for tablespaces. Each directory must be in a different device. Enter number of directories to use for tablespaces : 1 Enter directory 1 : /database/oracle/tablespace create_database:startup and create system01.dbf... Startup file is /database/oracle/dbs/initEHEALTH.ora Performing Oracle specific database initialization. unix_ora_installdb.sh:Error: Failed to create NH_INDEX ERROR: An error occurred during the Oracle installation phase. Please consult the log file and rerun this installation. # Here is an error message seen at the bottom of the CreateDb.log file: create_dbfile:Creating NH_INDEX SQL*Plus: Release 8.1.7.0.0 - Production on Wed Feb 13 15:19:37 2002 (c) Copyright 2000 Oracle Corporation. All rights reserved. SQL> Connected. SQL> SQL> 2 3 4 5 6 7 8 9 create tablespace NH_INDEX * ERROR at line 1: ORA-01119: error in creating database file `/database/oracle/tablespace/oradata/EHEALTH/nh_indx01.dbf` ORA-27044: unable to write the header block of file SVR4 Error: 27: File too large Additional information: 3 Disconnected from Oracle8i Enterprise Edition Release 8.1.7.0.0 - Production With the Partitioning option JServer Release 8.1.7.0.0 - Production ***** There was plenty of free diskspace in the partition where the tablespace was created (/database/oracle/tablespace): # df -k Filesystem kbytes used avail capacity Mounted on /dev/dsk/c0t1d0s0 1637233 940149 647968 60% / /proc 0 0 0 0% /proc fd 0 0 0 0% /dev/fd mnttab 0 0 0 0% /etc/mnttab swap 1411728 0 1411728 0% /var/run swap 1412496 768 1411728 1% /tmp /dev/vx/dsk/ebeta/vol01 60017664 12677944 46969904 22% /database /vol/dev/dsk/c0t6d0/eh55beta4_sol 631456 631456 0 100% /cdrom/eh55beta4_sol 2/14/2002 12:07:30 PM shonaryar We had them check system for large files and there was no setting on the system for large files. After they set the system for large file, they were able to create database. saeed 2/27/2002 10:20:23 AM Betaprogram Customer Verified I2/14/2002 8:27:39 AM Betaprogram Michael Loewenthal onsite at State Farm (309) 763-4561 - Craig Cies craig.cies.h2o6@statefarm.com While going through Migration and working on a problem with Saeed, we found out we ran out of Disk Space because we had 40GB of Archive Logs. Saeed said this has been fixed in Beta 5. We manually turned archiving off withing the Oracle Server Manager. -Submitted by Mike Loewenthal, the Beta AE 2/14/2002 10:38:58 AM shonaryar This is fixed in beta5. I added the code to turn off archiving while running nhLoadConfigInfo.sh saeed 2/14/2002 8:30:22 AM Betaprogram Michael Loewenthal onsite at State Farm 309) 763-4561 - Craig Cies craig.cies.h2o6@statefarm.com hile going through Migration, the conversion of Ingres to Oracle failed. Working with Saeed, he found what the problem was and had me manually fix the problem by loading in the tables by customized paramaters he created. The problem was due to a "funny" character in the DB. -Submitted by Mike Loewenthal, the Beta AE- 2/14/2002 9:13:28 AM rhawkes This is a repeat of 21059. Saeed provided a fix for this specific customer, but there is a larger issue of how bad data got into the Ingres database, which is the underlying cause of the problem. Robin will investigate that issue. (2/14/2002 11:33:14 AM Betaprogram EQUANT: EQUANT: Getting error during nhDbCreate 2/14/2002 2:03:28 PM rhawkes The installer had forgotten to set up the SGA parameter. This is documented in the installation guide. Saeed had the installer set this and the installation succeeded. f2/14/2002 6:05:26 PM rrick Issue: When executing and nhSaveDb get the following error: Unloading table nh_address . . . Unloading table nh_node_addr_pair . . . Fatal Internal Error: Unable to execute 'COPY TABLE nh_node_addr_pair () INTO '/nh/db/save/save021302b.tdb/nap_b30'' (E_SC0206 An internal error prevents further processing of this query. Associated error messages which provide more detailed information about the problem can be found in the error log, II_CONFIG:errlog.log (Wed Feb 13 22:17:11 2002) ). (cdb/DuTable::saveTable) Spoke with Yulun: - He said to bug it - I have requested a verify on the catalogs - I have also asked for a sysmod 2/25/2002 11:58:50 AM yzhang By lookinfg at the errlog, it seems that the nh_node_addr_pair table get corrupted, also their maintainace job was fail, let' do the following: 1) have them manully run maintanace job and send the output. 2) get verifydb on report mode for system catalog and nh_node_addr_pair table 3) size of ingres transaction log 4) echo "select count (*) from nh_node_addr_pair\g" | sql $NH_RDBMS_NAME > node_pair.out 5) echo "select count (*) from nh_address\g" | sql $NH_RDBMS_NAME > addr_cout.out Thanks Yulun 3/7/2002 12:05:01 AM rrick Spoke with Dennis: - Had a lot of issues trying to find ingres/data/default. - Customer has a very strange setup - Had to lay the system out on paper: /nh/idb/ingres /nh/db/save /nhdb/ingres/data /nhdblog/ingres /nhdbwork/ingres - Reboot server cleaned up memory and auto ran fsck to repair the filesystem - ingstop -kill - ipcclean - ingstart - dropDlg.sh....successfull - Rollforwarddb iidbdb - Sysmod nethealth......iiattribute with duplicates - Rollforward iidbdb - nhiDialogRollup...successfull - nhSaveDb....successful - Closing ticket j2/15/2002 10:13:53 AM Betaprogram UNISYS: David Lerchenfeld david.lerchenfeld@unisys.com (734)73707202 Unsure of status or errors< during reload of Polled Data. Error Follow: NETMAN3> /nethealth/migrate :nhLoadPolledData.sh -p /ehealth_tmp/polleddata Starting at Fri Feb 15 04:39:44 GMT 2002 Checking Oracle processes ... Initializing ... Inserting data types ... Loading utility tables ... idxcol.bad is detected, please check related log file. index.bad is detected, please check related log file. Please check related log files in /nethealth/migrate/schema/ing_ora_schema 2/15/2002 10:25:05 AM dshepard Not sure what this script does. I believe this belongs to the Db group. 2/20/2002 8:56:00 AM shonaryar I left him message three time since last 5 days and no respond! I gave status about this to Donna Amaral. saeed 2/20/2002 4:50:28 PM shonaryar This happened because size of a column was too small in migration's index tables. This is already fixed in beta 5. However I changed the error message in iomutils/script/nhLoadPolledData.sh to give better indication of error. saeed 2/15/2002 10:16:55 AM Betaprogram UNISYS: David Lerchenfeld david.lerchenfeld@unisys.com (734)737-7202 While running the Convert MyHealth data during migration run produced multiple core dumps error message follows: NETMAN3>/nethealth5.5/bin : /nethealth5.5/web/webCfg/nhiWebUtil -convertMyHealth Fatal Error: Assertion for `sts` failed, exiting (in file ./MhtGenMyHealthApp.C, line 1233). sh: 9191 Abort(coredump) Fatal Error: Assertion for `sts` failed, exiting (in file ./MhtGenMyHealthApp.C, line 1233). sh: 9204 Abort(coredump) Fatal Error: Assertion for `sts` failed, exiting (in file ./MhtGenMyHealthApp.C, line 1233). sh: 9213 Abort(coredump) Fatal Error: Assertion for `sts` failed, exiting (in file ./MhtGenMyHealthApp.C, line 1233). sh: 9222 Abort(coredump) Fatal Error: Assertion for `sts` failed, exiting (in file ./MhtGenMyHealthApp.C, line 1233). sh: 9231 Abort(coredump) Fatal Error: Assertion for `sts` failed, exiting (in file ./MhtGenMyHealthApp.C, line 1233). sh: 9240 Abort(coredump) interests = Network Performance how-did-u-hear = Web product-interest = Telco concern = Response Time, Availability, SLA’s Platforms = WindowsNT tools = HP OpenView sites = 79 circuits = 0-50 FramePVC = 0-100 ATM = 0-50 modems = 0-100 time = Immediate literature = ATM, Cisco Wan Manager, Concord in General, Frame Relay, Live Health, Remote Access, Router/Switch, Service Level Reports, Traffic Accountant 2/15/2002 4:45:47 PM fmali This is a repeat of 20694 which is fixed in Wanda Beta5. 2/15/2002 10:20:35 AM Betaprogram UNISYS: David Lerchenfeld david.lerchenfeld@unisys.com (734)73707202 Problems when Attempting to shutdown Ingres. Error Follow: NETMAN3> /nethealth/bin :su ingres % nhStopDb Stopping OpenIngres servers... Unable to shutdown ingres normally trying kill. . . kill: 16827: permission denied kill: 16818: permission denied kill: 16804: permission denied kill: 16827: permission denied kill: 16818: permission denied kill: 16804: permission denied ..finished successfully. % ps -ef | grep ingres ingres 6579 6277 0 05:03:39 pts/td 0:00 grep ingres ingres 6578 6277 4 05:03:39 pts/td 0:00 ps -ef root 16827 1 0 20:50:38 ? 77:46 /nethealth/idb/ingres/bin/iidbms dbms (default) DB root 16818 1 0 20:50:38 ? 0:07 /nethealth/idb/ingres/bin/dmfacp DB root 16804 1 0 20:50:35 ? 0:11 /nethealth/idb/ingres/bin/iidbms recovery (dmfrcp) DB ingres 6277 4928 0 05:03:16 pts/td 0:00 csh % kill 16827 16827: Not owner % 2/15/2002 4:47:46 PM rhawkes The processes should be running under UID "ingres", but on this system it seems that they were running as "root". 2/20/2002 8:55:39 AM shonaryar I left him message three time since last 5 days and no respond! I gave status about this to Donna Amaral. saeed 2/20/2002 2:39:09 PM Betaprogram Dave will not be testing this again until Beta 5. He agreed to close this problem. If it occurs again in Beta 5, we'll reopen it. - Donna Amaral 3/11/2002 2:00:22 PM Betaprogram Email from Beta Site 3/8/02 (prior to receiving Beta 5): TICKET #: 21157 Unable to reproduce close! l 2/15/2002 10:37:52 AM Betaprogram CCRD MIS: Peter Skotny (Concord ProServe) pskotny@concord.com 508-486-4081 Hi Saeed, I am testing the eHealth 5.5 migration process for Concord's MIS implementation. I ran into a problem when running "nhLoadConfigInfo.sh". It appears that I have some bad records in the database that cause nhLoadConfigInfo to fail. I checked the logs created for 1636.bad and 782.bad. There are a total of 14 bad records that failed to load. I ran a tar on ConfigInfo.tar and reran nhLoadConfigInfo.sh. This is what the documentation suggests. I got the same error. I have attached all the files that I think you need. These include the .ctl .bad .dat for 782 and 1636. I have also attached the output I got from the /ehealth_ingres/ command window as well as the output from command window for the stand alone poller. Thanks Peter Skotny **Attachments from email stored in original email from Peter in Outlook public folder: Eng>Beta Test>5.5>Beta Sites (active)> Concord MIS> Bugs** 2/15/2002 4:29:49 PM shonaryar This relate to 21059 saeed 2/19/2002 5:48:50 PM rtrei The bad elements tracked to having an unknown.mtf in the poller.cfg file. This means that they came in in some manner that the poller.cfg didn't know about them. It looks like 21174 is a similar situation. This is good news: we don't have to worry about massively corrupt cusotmer databases on data they are polling. We do have to figure out how to map the db elements to the unknow.mtf in the poller.cfg which isn't that easy, but we are starting to understand what could be the cause of the problem and how to solve it. Currently, I am awaiting info from Saeed on other cases which could also be related. 2/25/2002 5:28:29 PM rtrei Have built code, beginning unit testing 2/26/2002 2:41:55 PM rtrei see checkin mail 3/11/2002 2:32:04 PM Betaprogram Melissa, Beta 5 is not yet available for distributed sites. Also, Peter is no longer involved with CCRD beta (even though his name is on the tickets). Please send the messages to Alan and me. We will have a new MIS contact who will take over beta but they are not on board yet. Alan & Terry, when Peter O is on board, this is one of things he can do - verify that bugs 20272 and 21159 are confirmed fixed. BTW - Beta 5 for distributed is scheduled to be available Wed, March 13th. Donna 2/15/2002 11:42:29 AM Betaprogram UNISYS: David Lerchenfeld david.lerchenfeld@unisys.com (734)737-7202 While checking the eHealth logs after performing the 5.5 migration noticed errors in the DataAnalysis Log File: Following is the log info: Job started by User at `02/15/2002 04:56:55`. ----- ----- $NH_HOME/bin/sys/nhiDataAnalysis -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing 02/15/2002 04:57:01. Warning: Unexpected database error. Warning: Unexpected database error. Error: Unexpected database error. ----- Scheduled Job ended at `02/15/2002 04:57:02`. ----- 2/20/2002 8:55:30 AM shonaryar I left him message three time since last 5 days and no respond! I gave status about this to Donna Amaral. saeed 2/21/2002 1:07:37 PM shonaryar This happened because migration did not finished and some of tables and all indexes were missing saeed 3/11/2002 2:01:25 PM Betaprogram Email from Beta site on reply from Verifications email: TICKET #: 21164 Appears to be corrected in Beta 4 A2/19/2002 9:01:12 AM Betaprogram ALCATEL CANADA: Empowered Networks for Alcatel Canada: Jayde Hanley jayde.hanley@empowerednetworks.com 613-271-7971 nhLoadConfigInfo failed. Following is record of what happened after the execution of nhLoadConfigInfo.sh: # cd /database/ehealth/migrate # ./nhLoadConfigInfo.sh Starting at Thu Feb 14 13:18:15 EST 2002 Checking Oracle processes ... Loading configuration file into the database ... Creating tables ... Creating indexes ... Setting defaults ... Loading d< ata ... It is now 9:05 AM on Monday Feb 18, and this process has not completed. Database partition has completely filled up: $ df -k Filesystem kbytes used avail capacity Mounted on /dev/dsk/c0t1d0s0 1637233 953172 634945 61% / /proc 0 0 0 0% /proc fd 0 0 0 0% /dev/fd mnttab 0 0 0 0% /etc/mnttab swap 1338840 0 1338840 0% /var/run swap 1339808 968 1338840 1% /tmp /dev/vx/dsk/ebeta/vol01 60017664 60017664 0 100% /database Had to manually kill: ./nhLoadConfigInfo.sh[109]: 19251 Killed 1228.bad is detected, please check related log file. ERROR: some data files were not imported properly. Please check .log and .bad files before continuing with migration These log file are located at /database/ehealth/migrate/data/ing_ora_data You must untar configInfo.tar before running nhLoadConfigInfo.sh again # I have attached 1228.bad and 1228.log in a message sent to Donna Amaral. ***Requested files be sent to Betaprogram@concord.com, once done they will be forwarded to the engineer with the assigned ticket and also stored in Outlook public folder: Eng>Beta Test>5.5>Beta Sites (active)>Empowered>Alcatel Canada>Issues*** 2/19/2002 9:34:26 AM Betaprogram received files 2/19/2002 9:35:08 AM Betaprogram Donna, I have forwarded the attachments in a separate reply to Concord Beta. On Monday, After having killed this process, I destroyed and recreated the database and ran nhLoadConfigInfo again. Before I left Alcatel for the day, this process was still running and the /database partition was over 60% full. I expect that it will fill up to 100% as it did the first, as reported in this ticket. Jayde 2/19/2002 9:35:48 AM Betaprogram One more thing... I will not go onsite to Alcatel until I have a potential resolution to this problem, since I cannot proceed with any more testing until this piece works. I can be onsite today, just notify me when there is a possible solution I can try. Jayde 2/19/2002 3:43:29 PM rtrei Saeed--- Given the information available, this does not look like another version of the 'unknown.mtf' file corruption. The specified mtf files appear in thep oller.cfg and loook good. I am reassigning this back to you, but I'll be glad to give you what help I can. If we don't have access to this database, I think the top priority is to see what the data looks like in ingres and in the .dat file. To get a single table from ingres, all that they need to do is echo "copy table nh_mtf () into 'nh_mtf.dat'\g | sql $NH_RDMS_NAME that will create a file called nh_mtf.dat which can be ftp'd to our site. 2/20/2002 8:52:42 AM shonaryar I already talk to him and he changed DB to no archive mode. I am waiting on a feedback saeed 2/21/2002 2:57:10 PM shonaryar Problem with nh_mtf table is solve and I know where the problem comes from. I modified datafile and control file, then he could load table complete. He is continuing with the rest of migration. saeed 2/22/2002 3:26:40 PM shonaryar I am closing this ticket since he is past the point. saeed 2/19/2002 9:58:07 AM Betaprogram CCRD MIS: Peter Skotny (ProServ) pskotny@concord.com 5084864081 Problem starting the Atlanta Poller. Machine eh55-atl (10.0.0.73). A previous problem on this machine had left the poller.cfg file corrupt see bug # 20394. Rediscovered some elements successfully. Poller stopped and failed on restart. Error message "Internal Error (Configuration Server) Assertion for `elemPtr` failed, exiting (in file ./CfgServer.C, line 2793)." NEED HELP! 2/19/2002 11:08:15 AM dshepard When the cfgServer starts up, it reads the database and the poller.cfg file. Then it synchronizes the data in those two areas, potentially adding or modifying records between them until they match. The data in the poller.cfg takes precedence. Once that is complete it commits any updates and fills a new element cache via the dbServer. Then it iterates through the poller.cfg elements decorating each record with the dbId from the new element cache. That last step is the one that is failing. It suggests that config updates sent via the dbServer failed to make it into the database, or that the new element cache is corrupt. Reassigning to db group. I believe Will may have worked on a similar problem as well. 2/19/2002 5:50:01 PM rtrei I beleive this problem is because the nh_om_element view is not refreshed. I need to talk with Sanjay about this before proposing a work around. 2/21/2002 5:32:26 PM rtrei I manually made a change to one of our materialized views. It looks like that has fixed the problem. I did an nhServer start and it has been up for about 2 hours. Can you bring the console up, etc, and check tthat it seems to have been fixed in your opinion as well. If so, turn off the debugging and turn on the other servers. Meanwhile, I will make the code change to roll the fix I made into beta5, once you verify that it seems to have worked. thanks, Robin 2/25/2002 5:29:55 PM rtrei Ready to check in code, waiting Ravi's MV checkin 2/26/2002 3:00:10 PM rtrei code checked in 3/11/2002 1:44:36 PM Betaprogram Peter Skotny VERIFIED That this is now Fixed 2/20/2002 11:37:54 AM Betaprogram British Telecom Russell Webb +011- 44-1473 607852 russell.webb@bt.com Database log file of what happened when a 5.02 CNH (for clapton.omc.bt.co.uk) was migrated to 5.5 <> Can be found in Public Folders/All Public Folders/Engineering/Beta Test/5.5/Beta Sites (Active)/British Telecom/Bugs There are some sqlerrors in here - in particular 'scAppApi::initQuietMode' - maybe these are 'non-worriable' errors ! 2/20/2002 2:07:29 PM rhawkes Examples of errors in the logs include the following: ----- drop package sys.plitblm * ERROR at line 1: ORA-04043: object PLITBLM does not exist ----- create public synonym utl_http for sys.utl_http * ERROR at line 1: ORA-00955: name is already used by an existing object ----- drop library SYS.DBMS_PICKLER_LIB * ERROR at line 1: ORA-04043: object DBMS_PICKLER_LIB does not exist ----- create synonym DBA_ROLLBACK_SEGS for SYS.DBA_ROLLBACK_SEGS * ERROR at line 1: ORA-01471: cannot create a synonym with same name as object ----- Following is an informational message from 'WscAppApi::initQuietMode'. Sql Error occured during operation (ORA-01403: no data found ). ----- The last of these is already fixed in Beta 5. 2/22/2002 9:52:21 AM shonaryar These errors are normal and they are generated from Oracle packages which needs to be run after DB creation. However I am changing the message in nhCreatedDb.sh after succesfull creation to just print "Database Creation Complete." saeed 2/22/2002 4:00:46 PM shonaryar changed message in nhCreateDb.sh saeed 2/21/2002 9:34:21 AM Betaprogram British Telecom Russell Webb 011-44-1473-607852 russell.webb@bt.com For the nhSaveConfigInfo.sh it waits a long time before telling you that the poller initialisation won't complete. If poller initialisation takes more than 6 dots then maybe something is wrong. Killed the process and killed the 'standalone poller' process manually. Ignored the sqlerror saying that 'table nh_global_state does not exist or is not owned by you'. 2/21/2002 9:35:31 AM Betaprogram Tried to reboot but this didn't work either. Tried 'nhServer start' first but this says the servers are not running and actually gives an Ingres error 'Unable to make outgoing connection. System communications error - connection refused'. So stop ingres and restart - seemed to work! Reran nhSaveConfigInfo.sh. This actually says 'Stopping network health servers'. Now it says unable to connect < to the server, the license is not usable !! DOH!!!!! Re-tried running the script. This time it said 'nhServers were not running' but still wont initialise the poller. ABANDONED. Migration aborted. Too many problems. So I tried just starting eHealth without doing a migration of the data. There dont seem to be any post-installation tasks. BUT, Still get problems (even after reboot). It says: nhServer start Error: Unable to start the server (unable to connect to the db server process). I can in fact connect manually to Oracle i.e. using sqlplus system/ehealth so I dont know how CNH does this? 2/21/2002 2:58:34 PM shonaryar He is trying to do fresh installation now and he is having problem. Ravi and I worked with him, but we still don't know where the problem is coming from. saeed 2/22/2002 9:50:58 AM rpattabhi I have looked at this problem and it seems to be because of a misconfigured /etc/system file. Sending customer updated information. He also has a bad license file. -Ravi 2/25/2002 2:34:00 PM rpattabhi Why is this marked --- is it postponed? This customer is still having an issue with the oracle memory requirement. Marking this B5 -Ravi 2/26/2002 5:22:52 PM rpattabhi Fixed. Customer had < 0.5GB of memory changed the memory requirements for oracle. 2/21/2002 10:37:43 AM Betaprogram Jeff Beck - Concord NPI onsite at: CompuCom Mike Everhart 972-856-4428 meverhar@compucom.com Issue 1: Perhaps a non-issue. During the nhsaveconfiginfo script. I get the error: nh_global_state does not exist or is not owned by you. It finishes up and says it was succesfull at the end though, so I have been ignoring this one. 2/21/2002 12:24:45 PM shonaryar This is acceptable and we know about it. saeed 2/21/2002 10:39:26 AM Betaprogram Jeff Beck - Concord NPI onsite at: CompuCom Mike Everhart 972-856-4428 meverhar@compucom.com This one is a show stopper right now. When running nhcreatedb I get the error: Service for this SID already created. Enter different SID name. OSerror os1073 unable to create oracle service. I was using the default sid of EHEALTH. Just for kicks i changed the sid to EHEALTH1 and still get the same error. 2/21/2002 12:23:49 PM shonaryar He did not have his path set correctly. saeed h2/21/2002 4:13:40 PM Betaprogram UNISYS: Thursday, February 21, 2002 1:06:18 PM shonaryar UNISYS: David Lerchenfeld david.lerchenfeld@unisys.com (734)737-7202 Maximum number of cursor reached during. 2/21/2002 4:55:03 PM shonaryar This bug is aleady fixed in beta5. I called David and left him message for work around but he didn't call me back yet. saeed 2/25/2002 1:45:48 PM shonaryar This problem is fixed saeed 3/11/2002 2:02:43 PM Betaprogram Verification Email from Site: TICKET #: 21260 Fixed in Beta 5, applied fixes to Beta 4 and appears to have corrected problem. 2/21/2002 4:14:02 PM Betaprogram UNISYS: Thursday, February 21, 2002 1:06:18 PM shonaryar UNISYS: David Lerchenfeld david.lerchenfeld@unisys.com (734)737-7202 Maximum number of cursor reached during. This is a repeat of #21252. 2/22/2002 9:57:13 AM Betaprogram Repeat of 21260 3/11/2002 2:03:16 PM Betaprogram Email from Site: TICKET #: 21261 Fixed in Beta 5, applied fixes to Beta 4 and appears to have corrected problem. w2/21/2002 5:05:41 PM Betaprogram EQUANT: Contact: Concord AE: Nital Gandhi ----Original Message----- > From: Gandhi, Nital > Sent: Thursday, February 21, 2002 12:11 PM > To: Tang, TC > Subject: RE: remedy 21194 (equant join timeout) > > yes, and I think that worked. > But we have another problem with this box...for some reason > oracle is not running - should just have them do dbstart from > ORACLE_HOME/bin as root? I searched the kb but there is not > much there. > thanks, > nital > 2/22/2002 9:40:18 AM rpattabhi Somebody changed this bug to Saeed. I have sent mail to Nital about this. -Ravi 2/25/2002 11:56:22 AM rpattabhi Waiting for customer info. 2/26/2002 4:48:26 PM Betaprogram I'm closing this ticket. Customer has not proceeded with this this testing. If problem persists we will re-open the ticket. -Donna A 2/22/2002 2:43:16 PM ascorupsky Prof. Services developed custom variables files for IBM Global Services that worked for 4.8 Solaris 4.8 During upgrage to 5.02 nhConvertDb command gives Warning: Some rows do not reference valid entries in nh_element_type The invalid rows are being ignored Please correct the file elementTypeVariable.usr I can send tar file of deliverables on demand, here I print two very simple custom variable files that are used elementTypeVariable.usr 200|1000001|1|33 201|1000001|1|33 and variable.usr 1000001|2|SNA TRAFFIC|SNA TRAFFIC|snaTraffic I repeat - these customization worked successfully for eHealth4.8 2/26/2002 1:23:55 PM rdiebboll This was initially attributed to Product Component "eHDP Custom Variable", but it is not related. Assigning to Database. 2/26/2002 2:40:23 PM yzhang Alexander, you just entered this new ticket, Is this customer problem or in house problem. If it is in house, I want to see it. If it is from our customer, then we need call ticket number. Thanks Yulun 2/27/2002 6:58:39 PM yzhang the data in the elementTypeVariable.usr file customer trying to load into nh_elem_type_var table violates the referential integrity, because there is no corresponding primary key in nh_element_type table, next step is to find which file is loaded into nh_element_type table during dbconvert 2/28/2002 4:59:19 PM yzhang With will and Robin'e help, now we have the workaround, Alexander will send this new file (located on /tmp directory of system sulfur) to customer so that customer can keep going. I will continue working on the problem about why the convertDb does not convert the element type. Will and Robin, the new elementTypeVariable.usr created by will worked on the test Thanks for the help Yulun 3/4/2002 11:20:36 AM yzhang We have to work on the permanant fix for this one. first thing I want to know is that if the workaround works? also please get the upgrade log and/or dbconversion log from customer, and get some other information you think useful. Thanks Yulun 3/5/2002 1:23:48 PM rrick Message from Jeremy: - Workaround worked fine. 3/5/2002 1:34:49 PM yzhang please get the upgrade log and/or dbconversion log from customer, and get some other information you think useful. 3/6/2002 10:26:25 AM will It looks like nhiCvntElemTypes is not working correctly if there is no elementType.usr, elementTypeVar.usr or elementTypeEnum.usr files. If the customer has only created one of these, then the convert may fail. If the user has not created an elementType.usr file (but only added customer variables to existing types), then the conversion is skipped all together. Adding checks for the presence of each .usr file and not aborting the conversion if one is missing. 3/7/2002 5:01:11 PM will I've writen the code, but Yulun is going to test it for me, since he has an appropriate build environment and test case. 3/8/2002 6:31:55 PM will Testing done by Yulun indicates that the new nhiCnvtElemTypes is now working correctly when files are missing. He did encounter an error in nhConvertDb that I believe is related to having an inconsistent environment. He is going to retest the DB load with the fixed nhiCnvtElemTypes to see if the system works when starting in a consistent state. 3/13/2002 11:14:20 AM will Final load test worked. A one-off is now available. 3/21/2002 1:05:57 PM dwaterson Associating ticket 61853 - customer upgraded from 4.8 - 5.0.2 P1, D1 Solaris 2.7 nhConvertDb nethealth ***WARNING: Some rows do not reference valid entries in the nh_element_type. The invalid rows are being ignored Please correct the file: /opt/concord/db/data/elementTypeVariable.usr This is the elementTypeVariable.usr file: 104052|1000001|1|1000001 104052|1000002|1|27 104052|1000003|1|< 28 104052|1000004|1|29 104052|1000005|1|30 4|1000001|1|1000001 4|1000002|1|27 4|1000003|1|28 4|1000004|1|29 4|1000005|1|30 104114|1000001|1|1000001 104114|1000002|1|27 104114|1000003|1|28 104114|1000004|1|29 104114|1000005|1|30 3/22/2002 11:15:11 AM will The new symptoms reported by dwaterson are not covered by the currently existing one-off. There needs to be additional code to convert the full-duplex element-types and to remove duplicate entries from the file. 3/26/2002 4:02:54 PM will Fixed handling of missing and merged element types. New one-off available. 3/27/2002 12:16:43 PM foconnor ============================================ Call ticket 61853 Customer has removed these rows from the elementTypeVariable.usr file: 4|1000001|1|1000001 4|1000002|1|27 4|1000003|1|28 4|1000004|1|29 4|1000005|1|30 and nhConvertDb runs fine and he is not interested in testing the 21281 one-off closed Call ticket 61853 ============================================= 3/29/2002 11:29:55 AM will fixed in 5.0.2 P03 test plan: In 4.8, add custom variables to existing types 0,2,4,100,502,504, and 600. Add one variable common to all of the types, but add a unique one to each type as well. This initial configuration should be something like 0: A, B 2: A, C 4: A, D 100: A, E 502: A, F 504: A, G 600: A, H Make sure that NO custom element types are being added. Set up live exceptions on these element types and variables. Save the database and load it onto a 5.0 system with the patch applied. Validate that after loading the database, the variables for 2, 4, 502, and 504 have been consolidated together, the variables for 100 and 600 have been consolidated together, and that there are no duplicates in the elementTypeVariable.usr file. Make sure that there are no other errors when converting the types and that after the conversion, the new variables correctly show up in the report UIs. After the conversion, the configuration should be something like: 104053: A, B 104052: A, C, D, F, G 104114: A, E, H Validate that live exceptions is still correctly working against elements of these types and variables. 6/28/2002 4:23:01 PM aanjorin Passed on 5.0.2 P3 2/26/2002 10:30:20 AM Betaprogram Alcatel Canada: Contact: Jayed Hanley office: (613) 271-7971, cell (613) 290-5404 -----Original Message----- From: Jayde Hanley [mailto:jayde@empowerednetworks.com] Sent: Tuesday, February 26, 2002 9:30 AM To: shonaryar@concord.com Cc: damaral@concord.com Subject: nhLoadPolledData error Saeed, I ran nhLoadPolledData.sh yesterday. I took all day to run but it finally completed. There was an error message when it finished though: ......Loaded table nh_hourly_volume_1000014 .Loaded table nh_hourly_volume_1000015 .Loaded table nh_dlg0_1013619599 .Loaded table nh_dlg0_1013633999 Loaded table nh_rlp_boundary Load completed successfully Removing duplicate rows ... Generating primary keys... Reseting primary key on NH_RLP_BOUNDARY table ... Generating indexes ... ERROR: index.ddl failed to run successfuly. Please refer to /database/ehealth/migrate/log/nhLoadPolledDataSql.log for details. Process will continue, correct the problem and execute /database/ehealth/migrate/schema/ing_ora_schema/index.ddl again. Removing temporary migration tables from database ... Ending at Mon Feb 25 16:38:38 EST 2002 I have attached nhLoadPolledDataSql.log and nhLoadPolledData.log Please advise me as soon as possible whether this is a bug, and what my next steps should be. Jayde ----------------------------- Log files in this memo are in the Public Folders: Engineering\Beta_Test\5.5\Beta Sites - Active\Empowered\Alcatel Canada\Issues 2/26/2002 1:04:31 PM mfintonis updated the Software Revision Field from B3 to found in BETA 4 2/26/2002 1:59:13 PM shonaryar This error is fixed in beta5 saeed 2/26/2002 2:12:31 PM mfintonis changed status to fixed per last entry 3/19/2002 4:20:30 PM Betaprogram Customer Verified Fix: "No errors occurred when running nhLoadPolledData.sh." 52/26/2002 4:58:44 PM jpoblete Customer: British Telecom Exact nhiSaveDb fails while trying to unload a nh_stats1 table... Fatal Internal Error: Unable to execute 'COPY TABLE nh_stats1_1009843199 () INTO 'test.tdb/nh_stats1_1009843199'' (E_US0845 Table 'nh_stats1_1009843199' does not exist or is not owned by you. (Tue Feb 19 13:38:10 2002)). (cdb/DuTable::saveTable) Support has found the following problems: - The table nh_stats1_1009843199 is not listed in the help\g output from the DB - There is no reference to this table on the rollup boundary table on rlp_stage = 1, only for rlp_stage=2: |ST | 2| 1009670400| 1010275199| 1009670446| 1009843199| | 0| - No errors were found in the errlog.log at the time nhiSaveDb failed. - There is no file related to this table on iifile_info: INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. Ingres SPARC SOLARIS Version OI 2.0/9712 (su4.us5/00) login Tue Feb 26 12:53:42 2002 continue * Executing . . . +--------------------------------+--------------------------------+------------+-------+--------------------------------+-------------+-------------+ |table_name |owner_name |file_nam|file_e|location |base_id |index_id | +--------------------------------+--------------------------------+------------+-------+--------------------------------+-------------+-------------+ +--------------------------------+--------------------------------+------------+-------+--------------------------------+-------------+-------------+ (0 rows) continue * Your SQL statement(s) have been committed. Ingres Version OI 2.0/9712 (su4.us5/00) logout Tue Feb 26 12:53:42 2002 - The following error appears upon ingres startup ... nhStartDb invoked on Monday February 18 12:41:23 GMT 2002 Starting OpenIngres servers on Monday February 18 12:41:23 GMT 2002 ...started successfully. Sysmoding database 'nethealth' . . . Modifying 'iiattribute' . . . E_US1208 Duplicate records were found. (Mon Feb 18 07:41:57 2002) Sysmod of database 'nethealth' abnormally terminated. Exitting nhStartDB with status 17 At this point, we need engineering's help to save the customer's DB and then perform a Save/Destroy/Create/Reload Complete advanced logging of nhiSaveDb is available on the call ticket directory. \\BAFS\Escalated Tickets\59000\59628\Feb19 PS. In attempts to save the DB before it failed on this table, nhiDbSave failed on the following stage: 02/14/02 20:22:26 [z,cu ] returning env var = 'NH_BIN_SYS_DIR' for type = 4 02/14/02 20:22:26 [z,cu ] returning env var val = '' for type = 4 02/14/02 20:22:26 [z,cu ] returning env var = 'NH_BIN_DIR' for type = 2 02/14/02 20:22:26 [z,cu ] returning env var val = '' for type = 2 02/14/02 20:22:26 [z,cu ] returning env var = 'NH_HOME' for type = 1 02/14/02 20:22:26 [z,cu ] returning env var val = '/opt/neth' for type = 1 02/14/02 20:22:39 [z,cu ] returning env var = 'NH_DBLOC_STS_RAW' for type = 1135 02/14/02 20:22:39 [z,cu ] returning env var val = '' for type = 1135 02/14/02 20:22:39 [z,du ] Saving table nh_stats0_1013054399 to file new_dbSave.tdb/nh_stats0_1013054399 ... 02/14/02 20:22:39 [d,du ] Begin transaction level 1 02/14/02 20:22:39 [d,du ] Executing SQL cmd 'COPY TABLE nh_stats0_1013054399 () INTO 'new_dbSave.tdb/nh_stats0_1013054399'' ... 02/14/02 20:22:39 [d,du ] DuDatabase (execSql): errorOnNoRows: No 02/14/02 20:22:39 [Z,du ] (dbExecSql): errorOnNoRows: No 02/14/02 20:22:39 [Z,du ] (dbExecSql): sqlCmd: COPY TABLE nh_stats0_1013054399 () INTO 'new_dbSave.tdb/nh_stats0_1013054399' 02/14/02 20:22:56 [Z,du ] (dbExecSql): sqlca.sqlcode: 0 02/14/02 20:22:56 [Z,du ] (dbExecSql): rows: 164050 02/14/02 20:22:56 [Z,du ] returning DuScNormal 02/14/02 20:22:56 [d,du ] < Cmd complete, SQL code = 0 02/14/02 20:22:56 [d,du ] Saved table successfully. 02/14/02 20:22:56 [d,du ] Committing database transaction ... 02/14/02 20:22:56 [d,du ] Committed. 02/14/02 20:22:56 [d,du ] End transaction level 1 02/14/02 20:22:56 [z,cu ] returning env var = 'NH_BIN_SYS_DIR' for type = 4 02/14/02 20:22:56 [z,cu ] returning env var val = '' for type = 4 02/14/02 20:22:56 [z,cu ] returning env var = 'NH_BIN_DIR' for type = 2 02/14/02 20:22:56 [z,cu ] returning env var val = '' for type = 2 02/14/02 20:22:56 [z,cu ] returning env var = 'NH_HOME' for type = 1 02/14/02 20:22:56 [z,cu ] returning env var val = '/opt/neth' for type = 1 02/14/02 20:23:09 [s,cba ] Signal handler invoked for signal = 1, writing to pipe 02/14/02 20:23:09 [d,cba ] Exit requested with status = 272698112 02/14/02 20:23:09 [d,cba ] Exiting ... See: \\BAFS\Escalated Tickets\59000\59628\Feb15\nhiSaveDb_dbg.txt 2/27/2002 4:26:11 PM yzhang requested do a dbsave after running DbRollup, if dbsave fails at the same place, then we need to create that stats1 table, place an corresponding entry into rlp_boundary table, and touch the phyical file if there is no one. then redo the dbsave. 3/4/2002 1:20:49 PM jpoblete Yulun, The problem shifted ...now, we do not get the same error, see the save.log: Begin processing (04/03/2002 12:26:32). (dbu/DbuSaveDbApp::run) Copying relevant files (04/03/2002 12:26:33). (dbu/DbuSaveDbApp::run) Unloading the data into the files, in directory: 'test0401.tdb/'. . . Unloading table nh_alarm_rule . . . Unloading table nh_alarm_threshold . . . Unloading table nh_bsln_info . . . Unloading table nh_bsln . . . Unloading table nh_daily_exceptions . . . Unloading table nh_daily_health . . . Unloading table nh_daily_symbol . . . Unloading table nh_assoc_type . . . Unloading table nh_elem_latency . . . Unloading table nh_element_class . . . Unloading table nh_elem_outage . . . Unloading table nh_elem_alias . . . Unloading table nh_element_ext . . . Unloading table nh_elem_analyze . . . Unloading table nh_enumeration . . . Unloading table nh_exc_profile_assoc . . . Unloading table ex_tuning_info . . . Unloading table exception_element . . . Unloading table exception_text . . . Unloading table nh_exc_profile . . . Unloading table ex_thumbnail . . . Unloading table nh_list . . . Unloading table nh_list_item . . . Unloading table hdl . . . Unloading table nh_hourly_health . . . Unloading table nh_hourly_volume . . . Unloading table nh_elem_latency . . . Unloading table nh_col_expression . . . Unloading table nh_element_type . . . Unloading table nh_elem_type_enum . . . Unloading table nh_elem_type_var . . . Unloading table nh_variable . . . Unloading table nh_mtf . . . Unloading table nh_address . . . Unloading table nh_node_addr_pair . . . Unloading table nh_nms_defn . . . Unloading table nh_elem_assoc . . . Unloading table nh_job_step . . . Unloading table nh_list_group . . . Unloading table nh_run_schedule . . . Unloading table nh_run_step . . . Unloading table nh_job_schedule . . . Unloading table nh_system_log . . . Unloading table nh_step . . . Unloading table nh_schema_version . . . Unloading table nh_stats_poll_info . . . Unloading table nh_import_poll_info . . . Unloading table nh_protocol . . . Unloading table nh_protocol_type . . . Unloading table nh_rpt_config . . . Unloading table nh_rlp_plan . . . Unloading table nh_rlp_boundary . . . Unloading table nh_stats_analysis . . . Unloading table nh_schedule_outage . . . Unloading table nh_element . . . Unloading table nh_deleted_element . . . Unloading table nh_var_units . . . Unloading the sample data . . . Unloading the latest sample data definition info . . . Unloading table nh_rlp_boundary . . . Unloading table nh_stats_poll_info . . . Unloading table nh_import_poll_info . . . Unloading table nh_bsln_info . . . Unload of database 'nethealth' for user 'neth' completed successfully. Error: File not found. (dbu/saAppUtils::cpFiles) Error: File not found. (dbu/saAppUtils::cpFiles) Error: File not found. (dbu/saAppUtils::cpFiles) Error: The program nhiSaveDb failed. (dbu/DbuSaveDbApp::run) In the call ticket directory for March 04, it is the whole advanced logging How can we tell which are the three files where nhiSaveDb is failing ? 3/4/2002 1:31:33 PM jpoblete The advanced logging shows this: 03/04/02 15:36:00 [d,sa ] filename: Cellnet.grp 03/04/02 15:36:00 [i,cu ] Opening file = '/opt/neth/reports/lanWan/Cellnet.grp', mode = 0x5, prot = 0666 03/04/02 15:36:00 [z,cu ] Returning protection: 0664 for file: '/opt/neth/reports/lanWan/Cellnet.grp' 03/04/02 15:36:00 [i,cu ] Opened file: '/opt/neth/reports/lanWan/Cellnet.grp' 03/04/02 15:36:00 [i,cu ] Opening file = 'test0401.tdb/lanWan/Cellnet.grp', mode = 0x6, prot = 0666 03/04/02 15:36:00 [i,cu ] Opened file: 'test0401.tdb/lanWan/Cellnet.grp' 03/04/02 15:36:00 [i,cu ] Closing file = '/opt/neth/reports/lanWan/Cellnet.grp' 03/04/02 15:36:00 [i,cu ] Close complete, status = Yes 03/04/02 15:36:00 [i,cu ] Closing file = 'test0401.tdb/lanWan/Cellnet.grp' 03/04/02 15:36:00 [i,cu ] Close complete, status = Yes 03/04/02 15:36:00 [d,sa ] filename: Centre_For_EconomicsandBusiness_Research-BtNet_Start_64_Kbps-64K.gr 03/04/02 15:36:00 [m,cu ] Deleting CuBuffer[348976] 03/04/02 15:36:00 [m,cu ] Deleting data[0] = [0xf0e770] 03/04/02 15:36:00 [m,cu ] Deleting data[1] = [0xf09d28] 03/04/02 15:36:00 [m,cu ] Deleting data[2] = [0xf0c2d8] 03/04/02 15:36:00 [m,cu ] Deleting data[3] = [0xf0c9b0] 03/04/02 15:36:00 [m,cu ] Deleting data[4] = [0xf06cb0] 03/04/02 15:36:00 [m,cu ] Deleting data[5] = [0xf11b40] 03/04/02 15:36:00 [m,cu ] Deleting data[6] = [0xf13880] 03/04/02 15:36:00 [m,cu ] Deleting data[7] = [0xf15c90] 03/04/02 15:36:00 [m,cu ] Deleting data[8] = [0xf179d0] 03/04/02 15:36:00 [m,cu ] Deleting data[9] = [0xf1a150] 03/04/02 15:36:00 [m,cu ] Deleting data[10] = [0xf1be90] 03/04/02 15:36:00 [m,cu ] Deleting data[11] = [0xf1dbd0] 03/04/02 15:36:00 [m,cu ] Deleting data[12] = [0xf208a0] 03/04/02 15:36:00 [m,cu ] Deleting data[13] = [0xf23708] 03/04/02 15:36:00 [m,cu ] Deleting data[14] = [0xf39540] 03/04/02 15:36:00 [m,cu ] Deleting data[15] = [0xf4d258] 03/04/02 15:36:00 [m,cu ] Deleting data[16] = [0xf5dea8] 03/04/02 15:36:00 [d,sa ] DbuSaveDbApp::run is complete. 03/04/02 15:36:00 [d,cba ] Exit requested with status = 1 03/04/02 15:36:00 [d,cba ] Exiting ... It should not fail on this group... -rw-rw-r-- neth /opt/neth/reports/lanWan/Cellnet.grp -rw-rw-r-- neth /opt/neth/reports/lanWan/Centre_For_EconomicsandBusiness_Research-BtNet_Start_64_Kbps-64K.grp 3/8/2002 3:06:57 PM yzhang 18695 (the one Robin used to work on with Sheldon) mentioned that there is a workaround for this problem. Can you check with Sheldon for the work around first. if in case there is no workaround, run the following: sh -x NHINSTALL.NH >& instal_debug.out (use the INSTALL.NH from CD, don't put any debug information inside. Thanks Yulun 3/8/2002 5:28:18 PM yzhang check the group files undere reports, remove any groupfile which does not show on the console, then do the dbsave 3/25/2002 7:47:53 AM jpoblete Yulun, we have done this, the number of errors have decreased from 3 to 1, but still fails: Unload of database 'nethealth' for user 'neth' completed successfully. Error: File not found. (dbu/saAppUtils::cpFiles) End processing (21/03/2002 14:12:41). (dbu/DbuSaveDbApp::run) About 5830 files should be saved, but only 4497 are being saved. I sent you by e-mail the advanced logging for the process, and a list of the files in the tdb directory, and the list of all the files which should be saved. Thank You. JMP 3/2< 5/2002 3:46:08 PM yzhang Jose, Thanks for you investigation. I made some instrumentation for figuring out which file, or what files can not been found during save, the instrumented nhiSaveDb is located (on sulfur) /home/eng/yzhang/remedy/21350, they need to back up the original, then copy the new one into $NH_HOME/bin/sys, then run the command looks like the following, finally send me save_deb.out ./nhiSaveDb -Dall -p /export/sulfur3/nh48_s_m/db/save/save_ins_again.tdb -d nethealth > & /tmp/save_deb.out 4/2/2002 5:29:18 PM yzhang How is customer doing on this? 4/17/2002 2:01:18 PM yzhang Mike, This is my last update, I called Jose and he said this one went to you Thanks for you investigation. I made some instrumentation for figuring out which file, or what files can not been found during save, the instrumented nhiSaveDb is located (on sulfur) /home/eng/yzhang/remedy/21350, they need to back up the original, then copy the new one into $NH_HOME/bin/sys, then run the command looks like the following, finally send me save_deb.out 4/19/2002 11:50:34 AM tbailey The customer has now decided not to load the fix. They say the problem was resolved when they renamed all such group names to less than 66 characters, and that this resolved the error during database save. l2/27/2002 9:33:36 AM Betaprogram ALCATEL CANADA: Empowered Networks for Alcatel Canada Jayde Hanley jayde.hanley@empowerednetworks.com 613-271-7971 I could not successfully save a database due to the following error: $ nhSaveDb -p /database/beta/ehealth55.tdb See log file /database/ehealth55/log/save.log for details... Begin processing 02/26/2002 01:53:48 PM. Copying relevant files (02/26/2002 01:53:49 PM). Error: Database error: ERROR: SQLCODE=-258 SQLTEXT=ORA-00258: manual archiving i n NOARCHIVELOG mode must identify log . Error: The program nhiSaveDb failed. Refer to log /database/beta/ehealth55.tdb/oracle_rman/rman_save.log for more det ails.. $ The log file mentioned does not exist. 2/28/2002 9:59:46 AM rpattabhi MoreInfo Hi This is Ravi from Concord. Can you try the following work around? source nethealthrc.csh / nethealthrc.sh svrmgrl nhServer stop nhStopDb svrmgrl connect internal; startup mount; alter database archivelog; alter database open; exit; This should put your db back in archivelog mode. Now try the binary save again. It could have been left in noarchivelog mode by an ascii save that was terminated prematurely. Can you send me the log from the above commands? thanks -Ravi 2/28/2002 11:25:35 AM rpattabhi In talking with Saeed we seem to have told the customer to turn off archiving to speed up migration. May be the customer never turned it back on. I have already given the customer the workaround for this. -Ravi 2/28/2002 11:33:17 AM rpattabhi More info from customer and feedback from me, Jack, The nhServer stop command did a shutdown abort. (This has been changed in B5). You still haven't flipped the archiving mode because of error below. pls do the following instead. nhServer stop svrmgrl connect internal; shutdown immediate; alter database mount; alter database archivelog; alter database open; exit; nhServer start nhLoadDb -p ... Pls send me the output from the above as before, Thanks. -Ravi Ravi > -----Original Message----- Ravi > From: Jayde Hanley [mailto:jayde@empowerednetworks.com] Ravi > Sent: Thursday, February 28, 2002 10:15 AM Ravi > To: Pattabhi@empowerednetworks.com, Ravi; Ravi > 'jayde.hanley@empowerednetworks.com' Ravi > Subject: Ticket 21358 Ravi > Ravi > Ravi > Hi Ravi, Ravi > Ravi > The database was left in noarchivelog mode after Saeed Ravi > had assisted me in Ravi > getting nhLoadConfigInfo.sh to work. Ravi > Ravi > Here are the results of the commands. Ravi > Ravi > As ehealth: Ravi > $ nhServer stop Ravi > Stopping eHealth servers. Ravi > $ Ravi > Ravi > As oracle: Ravi > ebeta% nhStopDb Ravi > Oracle will now be shutdown with the 'abort' option. Ravi > Ravi > Oracle Server Manager Release 3.1.7.0.0 - Production Ravi > Ravi > Copyright (c) 1997, 1999, Oracle Corporation. All Ravi > Rights Reserved. Ravi > Ravi > Oracle8i Enterprise Edition Release 8.1.7.0.0 - Production Ravi > With the Partitioning option Ravi > JServer Release 8.1.7.0.0 - Production Ravi > Ravi > SVRMGR> Connected. Ravi > SVRMGR> ORACLE instance shut down. Ravi > SVRMGR> Ravi > Server Manager complete. Ravi > Database "EHEALTH" shut down. Ravi > ebeta% svrmgrl Ravi > Ravi > Oracle Server Manager Release 3.1.7.0.0 - Production Ravi > Ravi > Copyright (c) 1997, 1999, Oracle Corporation. All Ravi > Rights Reserved. Ravi > Ravi > Oracle8i Enterprise Edition Release 8.1.7.0.0 - Production Ravi > With the Partitioning option Ravi > JServer Release 8.1.7.0.0 - Production Ravi > Ravi > SVRMGR> connect internal; Ravi > Connected. Ravi > SVRMGR> startup mount; Ravi > ORACLE instance started. Ravi > Total System Global Area 875737248 bytes Ravi > Fixed Size 73888 bytes Ravi > Variable Size 141119488 bytes Ravi > Database Buffers 734003200 bytes Ravi > Redo Buffers 540672 bytes Ravi > Database mounted. Ravi > SVRMGR> alter database archivelog; Ravi > alter database archivelog Ravi > * Ravi > ORA-00265: instance recovery required, cannot set ARCHIVELOG mode Ravi > SVRMGR> alter database open; Ravi > Statement processed. Ravi > SVRMGR> exit; Ravi > Server Manager complete. Ravi > ebeta% Ravi > Ravi > As ehealth: Ravi > $ nhSaveDb -p /database/beta/ehealth55.tdb Ravi > See log file /database/ehealth55/log/save.log for details... Ravi > Begin processing 02/28/2002 10:06:58 AM. Ravi > Copying relevant files (02/28/2002 10:07:03 AM). Ravi > Error: Database error: ERROR: SQLCODE=-258 Ravi > SQLTEXT=ORA-00258: manual archiving Ravi > in NOARCHIVELOG mode must identify log Ravi > . Ravi > Error: The program nhiSaveDb failed. Ravi > Refer to log Ravi > /database/beta/ehealth55.tdb/oracle_rman/rman_save.log for more Ravi > details.. Ravi > $ Ravi > Ravi > 2/28/2002 4:19:50 PM Betaprogram Ravi, The database save now works. Here is the output. ebeta% svrmgrl Oracle Server Manager Release 3.1.7.0.0 - Production Copyright (c) 1997, 1999, Oracle Corporation. All Rights Reserved. Oracle8i Enterprise Edition Release 8.1.7.0.0 - Production With the Partitioning option JServer Release 8.1.7.0.0 - Production SVRMGR> connect internal; Connected. SVRMGR> shutdown immediate; Database closed. Database dismounted. ORACLE instance shut down. SVRMGR> startup mount; ORACLE instance started. Total System Global Area 875737248 bytes Fixed Size 73888 bytes Variable Size 141119488 bytes Database Buffers 734003200 bytes Redo Buffers 540672 bytes Database mounted. SVRMGR> alter database archivelog; Statement processed. SVRMGR> alter database open; Statement processed. SVRMGR> exit; Server Manager complete. ebeta% # nhSaveDb -p /database/beta/ehealth55.tdb See log file /database/ehealth55/log/save.log for details... Begin processing 02/28/2002 02:22:56 PM. Copying relevant files (02/28/2002 02:23:00 PM). Backup completed End processing 02/28/2002 03:01:52 PM. # Here is size info: # du -sk ehealth55.tdb 3543216 ehealth55.tdb 3/19/2002 4:21:13 PM Betaprogram Jayde Verified that this is indeed fixed in Beta 5 2/28/2002 9:29:16 AM Betaprogram a 6k element db running on an inadequate system (386) platform Win2k was saved (Beta4 version) To be loaded to a new machine also running WIN2k where eHealth Beta 4 is installed. The Load 1st try got a segementaion violation: D:\eHealth\bin\nhloaddb -p d:\ehealthdb\.tdb The database you chose to load is from Machine name: nethealth Machine < type: server Do you want to load the database?{n} y See log file D:/eHealth/log/load.log for details... Begin Processing 2/27/2002 03:11:52 _ ...... segmentation fault [1] + Done<139>? 788 Segmentation violation D:/eHealth/bin/sys/nhiLoadDb Looks like the path slashes are wrong ... He tried again (moved the .tdb to different drive) and got pop up message: nhiLoadDb.EXE: Internal Error: Unable to connect to database 'EHEALTH' (ORA-12560) TNS protocol adapter error ), (du/DuDatabase::dbConnect) I will email his emails to Rich Hawkes, they are ipg images. - Joe Banks direct number is 703 428 1548 2/28/2002 10:28:15 AM rpattabhi Called customer. He needs to do a start db and try reloading. -Ravi 2/28/2002 10:28:43 AM rpattabhi More info 2/28/2002 2:46:14 PM rpattabhi Talked to customer he did a nightly save without specifying a path and the save put the file in $NH_HOME\.tdb if this is done the file is no longer loadable by nhloaddb which tags on the .tdb extension. Asked the customer to do a binary save with an actual file name. -Ravi 2/28/2002 5:54:28 PM rpattabhi Customer called saying that even after the binary save the load had a problem. Will have talk to him tomorrow. 3/1/2002 12:48:50 PM rpattabhi Here is some more updated information sent to the customer. He will be retrying the saves without using the checkpoint option as this is not supported in oracle. So all the info we have does not suggest a bug. The one issue where he hit a segv could not be reproduced. This happened probably because the save directory was .tdb instead of save.tdb and when we flip the \\ to // we may have had a bug. However I dont see this problem on my NT machine. We should close this issue as NOBUG unless the customer tries the below steps and reports a different problem He has Rich Hawkes number and will contact him. Assigning to Rich. Joe: Pls do the following: To save: nhSaveDb -p c:/ehealthdb/binary_save_3_1_2001 Toload: nhLoadDb -p c:/ehealthdb/binary_save_3_1_2001 If you have tried the above and it failed send me the following. send me the files: 1. save.log and load.log in the directory %NH_HOME%\log 2. send me the following files from the \oracle_rman a) *.log b) *.scr c) *.cmt d) *.cfg Note if you tried save from the UI / scheduled job and it failed because of using checkpoint option then just run the save and load mannually as I described above. Can you pls send me this info ASAP as I will be on vacation starting tomorrow and I would like to get this resolved ASAP. thanks -Ravi 3/1/2002 5:06:26 PM rhawkes Customer testing. 3/4/2002 12:51:22 PM rhawkes Phoned customer. He claims he is still seeing the problem, and will e-mail us answers to Ravi's questions. Moving back into MoreInfo state. 3/4/2002 3:20:21 PM rhawkes We received additional files for this investigation -- assigning to Saeed. 3/5/2002 8:49:47 AM rhawkes Anil has a ticket with a similar symptom, so he agreed to to at least a preliminary investigation of this one too. The customer in this case has sent us files that Ravi had requested. 3/5/2002 4:32:14 PM rtrei 21380-- (Joe Banks) nhLoadDb failure I blieve this will turn out to be a noBug. When I looked into it with Joe, it turned out that his environment variables were not set correctly for ORACLE, so it was failing to connect to the database. I had him reset the variables, reboot, recheck that they were set correctly, then do a nhSaveDb. At that point, he was supposed to call me. He hasn't called; I left him additional voicemail. Donna A- can you have someone check in the AM? However, at this point I am thinking no news is good news. 3/5/2002 4:38:54 PM Betaprogram At 4:30 I left Joe vmail indicating the urgency of his feedback and asked him to make it a priority to call Robin or me regarding status asap. -Donna 3/5/2002 4:41:30 PM Betaprogram Hi Joe, After resetting the variables, rebooting, and rechecking that they were set correctly, you were to then do a nhSaveDb. Can you let us know how this worked out and what the status of this bug is? Thanks!! -Melissa 3/6/2002 11:22:27 AM rtrei Problem was that NT box did not have ORACLe environment variables set. This caused connection to fail and load to fail. 2/28/2002 3:14:14 PM cestep Customer originally had a problem with the server crashing during data analysis. They would received the following error in $II_SYSTEM/ingres/files/errlog.log during the data analysis job: 00000138 Wed Jan 30 05:00:21 2002 E_DM9351_BTREE_BAD_TID Btree Error on table (nh_hourly_health, concord) of database 'nethealth' : Leaf page references non existent tuple : Bid (1534, 0), Tid (1557, 0), Key is '.A6 nh_hourly_health.out 8. Type: echo "help table nh_rpt_config\g" | sql nethealth > rpt.out 9. Please send these files. 10. Restart the nethealth server. Also, would the customer be willing to upgrade to eHealth 4.8? Please contact me with any questions. Thank you, 3/6/2002 7:43:45 AM cestep Received information from the customer. All files are on BAFS under 58609/3.6.02 3/6/2002 7:46:38 AM cestep From the nhDbStatus output: The NH_HOME environment variable was set to a directory that is incorrect or missing a needed file: D:/nethealth 'nhDbStatus' should be in a bin subdirectory beneath NH_HOME; and NH_HOME should have a file named 'nethealthrc.sh'. Please set NH_HOME to the correct location, or unset it. 3/6/2002 10:18:02 AM yzhang I looked at your data files, there are several schedule jobs failed recently. the other possibilty is that the database has been corrupted. Colin, can you work with them for the following: we have to make sure their stats rollup, mantanace and conversation rollup (even though it is not thier fucos), and dbstats work properly. work with them for using correct commands. after you recycle nhServer and ingres, get sysmod for me first. 1) stop nhServer from service windon, 2) stop ingres from service windon 3) stop ingres from service window 4) start ingres from service window 5) sysmod nethealth > sysmod.out 6) start nhServer from service window 7) nhReset.sh > nhReset.out 8) nhDbStats >nhDbstats.out 8) nhiRollupDb > stats_rollup.out 9) nhiDailogRollup > conversation.out 10 nhiDataAnalysis >DA.out 3/7/2002 9:53:52 AM cestep All requested info received. Changing to assigned. 3/7/2002 10:24:28 AM yzhang Colin, Since their db is small and they only have stats information, the quick option is for them to recycle the database, find out the latest dbsave first, and please write the procedure for recycle database. Thanks Yulun 3/13/2002 11:12:28 AM cestep The reseller has obtained the database and loaded it on a 4.8 system they have there. They were successful with the data analysis, so they will upgrade the customer. 3/20/2002 8:07:02 AM dbrooks per bug meeting on 3/19. U3/4/2002 12:08:35 PM Betaprogram AMASOL: Thomas Dirsch dirsch@amasol.de +49 89 589 390311 I wanted to do an Ingres->Oracle Migration by Database Load but this feature seems not to be supported with beta4 ( i got an error message about an non existing oracle path within the ingres backup). 3/6/2002 1:45:32 PM shonaryar I call Thom but I was not able to talk to him. He is in a conference in Europe and he will not be back until Monday. I will mark ticket as MoreInfo. saeed 3/11/2002 9:04:10 AM shonaryar I called Thomas again today, he was in doctor and he will not be avaliable until Wednesday saeed 3/11/2002 11:26:49 AM rhawkes This is probable user error -- it seems he's trying to load an Ingres database directly into Oracle. Saeed to confirm when the contact returns on Wed. 3/13. 3/13/2002 1:18:06 PM shonaryar I call him a few times today and I was not able to get in touch with him. This seems like a user error apparently he was thinking he can use nhLoadDb to migrate, which is not supported. I am closing this ticket, unless I get another case. saeed ) 3/4/2002 12:29:03 PM Betaprogram ALCATEL CAN: Empowered Networks for Alcatel Canada: Jayde Hanley jayde@empowerednetworks.com 613-271-7971 DB Issue created from ticket # 21360 per Donna Amaral Melissa, Please open a new ticket for this. I'll try to find someone who can help Jayde. Thanks, Donna -----Original Message----- From: Jayde Hanley [mailto:jayde@empowerednetworks.com] Sent: Monday, March 04, 2002 10:46 AM To: concord_beta@concord.com Cc: wlauer@concord.com; damaral@concord.com Subject: Ticket 21360 I received a workaround fie from Will (system.licenseDefs.omx) to fix this problem. Before I received it though, I attempted to destroy and recreate the database, and the load a saved copy of the 5.5 database. The destroy and create completed successfully. When trying to load the db I received this error: $ nhLoadDb -p /database/beta/ehealth55.tdb The database you chose to load is from Machine name: ebeta Machine type: server Do you want to load the database? [n] y See log file /database/ehealth55/log/load.log for details... Begin processing 03/04/2002 10:20:19 AM. Cleaning out old files (03/04/2002 10:20:19 AM). Copying relevant files (03/04/2002 10:20:20 AM). Internal Error: Unable to connect to database 'EHEALTH' (ORA-01033: ORACLE initialization or shutdown in progress). (du/DuDatabase::dbConnect) Error: Database error: ERROR: SQLCODE=-3114 SQLTEXT=ORA-03114: not connected to ORACLE . Error: The program nhiLoadDb failed. Refer to log /database/beta/ehealth55.tdb/oracle_rman for more details.. Error: nhiLoadDb failed. $ Fatal Error: Assertion for 'db->isConnected ()' failed, exiting (in file ./Sv rApp.C, line 243). I tried running nhStartDb and then loading the database again, with the same results. Also, I could no longer start the eHealth server. Doing so gave these results: $ nhServer start Starting eHealth servers. $ Fatal Error: Assertion for 'db->isConnected ()' failed, exiting (in file ./SvrApp.C, line 243). Please advise as to how to proceed. Jayde 3/5/2002 2:13:50 PM wzingher This issue is being run as a fire drill right now. It is also being covered in ticket 21380 3/5/2002 4:33:13 PM rtrei 21440-- (Alcatel Canada) nhLoadDb failure My executive summary is that I am hoping this problem will turn out to be a rare situation, and that we can provide a work around to Tech Support for when it does happen. This would allow us to put a fix (if needed) in to the patch stream with proper testing, etc. However, at this point in time, we still do not know if this is a bug or not. We were hoping to manually test the workaround but the needed customer time was unavailable. We are getting the database ftp'd to us over night, so hopefully we will have it in the morning for us to look at. jayde called, here is ftp'ing the database to us as we speak , should be available in the morning. 3/8/2002 4:39:37 PM shonaryar When jayde was trying to run debug on nhiDbServer + nhiCfgServer system froze so the file which we got was incomplete. We asked him to set debug for nhiCfgServer only and re-run nhServer again, but while he was waiting for nhServer to crash, he had to leave the building. DB save which he did send to me was created using gunzip, which my system and his system had trouble unpacking so I asked him to put it on CD and bring it in Monday, however he had to leave the building and he said they didn't have time to do that. So Monday Dean will try to do a compress of DB Save and send it to us. We did get poller.cfg and it was consistent with the number of element in database. saeed 3/12/2002 9:08:56 AM shonaryar I was able to load the database when I used below procedure If in the rman_load3.log RMAN-6025: no backup of log thread 1 seq xxxx scn xxxxxxx found to restore was found, you can work around it by using these procedures In oracle_rman directory in saveDb location Open file rman_save.log. search for last occurrence of input archivelog thread it should be some thing like "input archivelog thread=1 sequence=6240 recid=1591 stamp=455036317" Then open file nh_rman_restore3.scr and change first line after { from some thing like set until scn 1944699; to set until logseq 6240 thread 1; Save the file. Source nethealthrc.csh run cmds below: rman nocatalog RMAN> connect target; RMAN>@nh_rman_restore1.scr RMAN>@nh_rman_restore2.scr RMAN>@nh_rman_restore3.scr login to oracle using sqlplus and try to do some transactions< . SQLPLUS $NH_USER/$NH_USER SQL> select count(*) from nh_element_core; SQL> select table_name from user_tables; run nhConvertDb as $NH_USER. I think we should check rman_save.log for the last logseq # and use this number to create *.scr file for dbLoad instead of trying to caluculate the last scn # saeed 3/13/2002 2:20:00 PM rhawkes Ravi thinks this will occur one in 10,000 times. If they see this problem there is a workaround documented above. They can also save and load to correct the problem. For 5.5 we should doc this in the Tech Tips. 3/18/2002 9:10:53 AM rhawkes Changing state to Assigned, since this is a tech tip issue now. 3/27/2002 2:24:14 PM rhawkes Added to Tech Tips. 6/6/2002 9:12:22 AM rpattabhi This bug should not be marked fixed. We gave them a workaround we still need to fix this in 5.6 -Ravi 6/18/2002 7:05:39 PM rpattabhi Checked in: ----------- Checked in "/vobs/top/wsCore/oracle" version "/main/octopus/7". Checked in "/vobs/top/wsCore/oracle/DbuLoadDbApp.C" version "/main/octopus/6". Checked in "/vobs/top/wsCore/oracle/dbuRman.H" version "/main/octopus/3". Checked in "/vobs/top/wsCore/oracle/dbuRman.pc" version "/main/octopus/3". Checked in "/vobs/top/wsCore/oracle/nhiLoadDb.C" version "/main/octopus/3". Checked in "/vobs/top/wsCore/oracle/sa_cre_phs.pc" version "/main/octopus/3". Checked in "/vobs/top/wsCore/oracle/sa_cre_tbl.pc" version "/main/octopus/3". Checked in "/vobs/top/wsCore/oracle/sa_dbs_utl.pc" version "/main/octopus/8". Checked in "/vobs/top/wsCore/oracle/sa_grt_prv.pc" version "/main/octopus/3". Checked in "/vobs/top/wsCore/oracle/sa_loa_dbs.pc" version "/main/octopus/2". Checked in "/vobs/top/wsCore/oracle/scripts" version "/main/octopus/5". To fix: ------- Bug#: 23011, 23005, 23004, 22930, 21440, 21025 Fix Summary: ------------ #23011 - sqlldr logfiles are not chmoded #23005 - pupbld.sql was not being run on NT causing the product user profile error in oracle sqlplus #23004 - rman load db was overwriting the error object causing the correct error to not show up on the console #22930 - migration needed fixes to cleanup output as well as to get convert db to work since recent changes to add state checks - Also fixed migration from calling alter null columns to not null. This was taking 9hrs to run earlier. Avoided doing this by creating the tables with the constraints in place. #21440 - Alcatel canada bug which caused their database not to recover because a archive log file was created in between the time we created a logfile and before we could get the scn number for the logfile. #21025 - nhClearDb name change in the vob to nhiClearDb. To test: -------- 1. Run nhLoadDb from an ingres save and verify migration load works. 2. Create a database on NT and verify sqlplus does not have PUPBLD error. 3. Run a machine to machine move with Rman load and save 4. Rerun performance test for migration and verify times match 5.5 times for 50K db. Compiled On: ------------ SOL NT Approver: Rich Hawkes Reviewer: Saeed Honaryar DI: NONE FI: NONE PI: Rerun performance test for migration and verify times match 5.5 times for 50K db. BI: NONE I18N: NONE 3/4/2002 5:24:43 PM knewman Unable to modify and save changes to poller --- Modify poller configuration GUI Server stops Error: server stopped unexpectedly restarting --- Manually restart server ---- Discovered devices - redisovery of existing devices Internal error expectation for '_dbid failed (infile../esdObject.C, line 365) CU/assertA --- Manually brought services down, ipcclean, brought up still unable to save changes to poller config INGRES TRANSACTION LOG SIZE 300 Meg Number of Elements: 16931 requested nhResizeIngresLog 1200 ---- ran nhResizeIngresLog 1200 We are still having issues here. Below is an output of the error message from the console after making any modifications to the poller confide or making a discovery. Tuesday, 02/19/2002 08:40:27 Internal Error (console) Expectation for '_dbld' failed (in file ../esdObject.C, line 365). (cu/cuAssert) Tuesday, 02/19/2002 08:41:23 User concord modified the poller configuration. (Log: /opt/ehealth/log/pollerAudit.02.19.2002.084122.log). Tuesday, 02/19/2002 08:42:36 Warning (Message Server) Attempt to release unlocked resource Poller Confide (All). SUNLAN01::[50910 , 0000074e]: Mon Feb 18 16:18:45 2002 E_DMA00D_TOO_MANY_LOG_LOCKS This lock list cannot acquire any more logical locks. The lock list status is 00000000, and the lock request flags were 00000008. The lock list currently holds 700 logical locks, and the maximum number of locks allowed is 700. The configuration parameter controlling this resource is ii.*.rcp.lock.per_tx_limit. SUNLAN01::[50910 , 0000074e]: Mon Feb 18 16:18:45 2002 E_DM004B_LOCK_QUOTA_EXCEEDED Lock quota exceeded. --- customer brought services down, ck shmem/ sem, edit config.dat (rcp.lock.per_tx_limit. Increase this number to 1200) and brought services back up Files on bafs: DbCollect systemSpecs Advanced logging MsgServer NhiConsole 3/7/2002 2:55:32 PM rrick Spoke with Mike: - No more messages in errlog.log regarding LOCKS - He can modify and save things in the poller.cfg with no problrms - All discoveries are working fine - Closing ticket. 3/5/2002 3:36:45 PM Betaprogram STATE FARM: -Submitted by Mike Loewenthal, the Beta AE- Craig Cies craig.cies.h2o6@statefarm.com (309) 763-4561 At the end of the nhLoadPolledData.sh script, the INDEX.DDL did not load/execute properly. Contacted Saeed Honaryar and said he knew of this problem and was fixed in Beta 5. He has created a fix which we will try. 3/6/2002 8:58:32 AM shonaryar This bug is fixed in beta5, I email him a file for work around. saeed 3/27/2002 2:44:43 PM Betaprogram Site verified Fix "M3/7/2002 10:21:18 AM jpoblete Customer: Teleglobe Statistics Rollup fails wiuth the following error: ----- Job started by User at '03/07/2002 02:06:33 AM'. ----- ----- $NH_HOME/bin/sys/nhiRollupDb -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing 03/07/2002 02:06:33 AM. Error: Sql Error occured during operation. ----- Scheduled Job ended at '03/07/2002 02:16:04 AM'. ----- The advanced logging for the process shows the following: 03/05/02 16:09:53 [d,du ] End transaction level 1 03/05/02 16:09:53 [d,cdb ] End rollup of stats type ST, stage 1 03/05/02 16:09:53 [d,cdb ] Rolling up stats data level 2. 03/05/02 16:09:53 [d,cdb ] Beginning rollup of stats type ST, stage 2 03/05/02 16:09:53 [d,cdb ] Selecting range of nh_rlp_boundary, type ST, stages 2 - 2. 03/05/02 16:09:53 [d,du ] Begin transaction level 1 03/05/02 16:09:53 [Z,du ] sqlca.sqlcode: 0 03/05/02 16:09:53 [Z,du ] rows: 1 03/05/02 16:09:53 [d,cdb ] Done selecting range of nh_rlp_boundary, type ST, stages 2 - 2. 03/05/02 16:09:53 [d,du ] Committing database transaction ... 03/05/02 16:09:53 [d,du ] Committed. 03/05/02 16:09:53 [d,du ] End transaction level 1 03/05/02 16:09:53 [d,cdb ] End rollup of stats type ST, stage 2 03/05/02 16:09:53 [d,cdb ] Removing all deleted stats elements. 03/05/02 16:09:53 [d,du ] Begin transaction level 1 03/05/02 16:09:53 [Z,du ] sqlca.sqlcode: 0 03/05/02 16:09:53 [Z,du ] rows: 1 03/05/02 16:09:53 [z,cu ] returning env var = 'TMPDIR' for type = 5 03/05/02 16:09:53 [z,cu ] returning env var val = '/usr/local/neth/tmp' for type = 5 03/05/02 16:09:53 [Z,du ] sqlca.sqlcode: 0 03/05/02 16:09:53 [Z,du ] rows: 1 03/05/02 16:09:53 [Z,du ] sqlca.sqlcode: 100 03/05/02 16:09:5< 3 [Z,du ] rows: 0 03/05/02 16:09:53 [Z,du ] sqlErrorCode: 100 03/05/02 16:09:53 [Z,du ] sqlErrorText: 03/05/02 16:09:53 [Z,du ] sqlErrorMsg : 100, 03/05/02 16:09:53 [d,du ] Rolling back database transaction. 03/05/02 16:09:53 [d,du ] End transaction level 1 03/05/02 16:09:53 [d,cba ] Exit requested with status = 1 03/05/02 16:09:53 [d,cba ] Exiting ... 03/05/02 16:09:53 [d,du ] Disconnecting from db: ehealth, user: nethealth, handle: [0xffbef518] ... 03/05/02 16:09:53 [d,du ] Disconnected. It looks there is some problem while initializing on the nh_rlp_boundary table. The whole advanced logging is on the call ticket directory: \\BAFS\Escalated Tickets\61000\61086 3/15/2002 11:08:25 AM wburke -----Original Message----- From: Burke, Walter Sent: Thursday, March 14, 2002 4:54 PM To: Trei, Robin Cc: Gray, David Subject: Rollups Fail (silently) 61427 , PT # 21527 Robin, Problem solaris 5.0.2 p01, Solaris Rollups Fail with sql error during operation, no table mentioned. Will need to esaclate as two customers have seen this, BOTH distributed reporting. Attached is a printqry and advanced logging for the failure: 3/18/2002 10:16:59 AM wburke -----Original Message----- From: Bui, Ha Sent: Monday, March 18, 2002 10:06 AM To: Poblete, Jose; Burke, Walter Subject: ticket 21527 Importance: High Hi I have this ticket related to statistics rollup problem. I looked thru the log and it looked like table nh_deleted_element_core and nh_deleted_element_aux didn't have the same elements. The sql statement failed when the program tried to delete "deleted elements". I need either one of you to confirm this for me. Could either one of you please contact the customer and ask them to verify that? The way to do that is to sql into the database, then do 1) select element_id from nh_deleted_element_core where element_id not in (select element_id from nh_deleted_element_aux) 2) select element_id from nh_deleted_element_aux where element_id not in (select element_id from nh_deleted_element_core) If 1) or 2) returns some rows, we need to insert into the table some dummy data with the same element_id so that the rollup can continue. Thanks and let me know how it turns out. 3/18/2002 11:02:44 AM wburke -----Original Message----- From: Burke, Walter Sent: Monday, March 18, 2002 10:51 AM To: Bui, Ha; Poblete, Jose Subject: RE: ticket 21527 1) 105 rows returned on statement 1. 2) ) on Statement 2. -Walter 3/18/2002 1:39:09 PM wburke -----Original Message----- From: Bui, Ha Sent: Monday, March 18, 2002 1:27 PM To: Burke, Walter Subject: RE: ticket 21527 Walter, Could you ask the customer to send me their database ( if it's small ) or save the tables nh_element_core, nh_element_aux, nh_deleted_element_core, nh_deleted_element_aux and send them to me. For now, they can create dummy records in nh_deleted_element_aux so that statistics rollup can go on. They can do that manually or they can use this scripts to insert dummy record. Do you know if they have distributed polling environment? I need to figure out why those two tables (deleted ones) have inconsistent data. If they have deleted the elements thru the console, I don't believe that the database would be in this status. 3/18/2002 3:43:45 PM jpoblete -----Original Message----- From: Poblete, Jose Sent: Monday, March 18, 2002 3:32 PM To: Bui, Ha Subject: 21527 Hi Ha! This is the output of your queries in the following order: - select element_id from nh_deleted_element_core where element_id not in (select element_id from nh_deleted_element_aux)\g (wait to get the result of the query.) - select element_id from nh_deleted_element_aux where element_id not in (select element_id from nh_deleted_element_core)\g Hope this helps -JMP (sent the output attached) The file is also in the call ticket directory \Mar18 3/20/2002 3:43:46 PM hbui Problem: nhFetchDb, nhRemoteSaveDb do not save and fetch data in nh_deleted_element_aux. Meanwhile, nhRollup deletes data from nh_deleted_element_core and nh_deleted_element_aux at the same time. If it can't find the same deleted elements in both tables, it won't proceed further. Gave Walter and Jose the script to work around the problem. The script put dummy data into nh_deleted_element_aux for all entried in nh_deleted_element_core if they have not been inserted into nh_deleted_element_aux. Thus, the rollup will go on. The fix shall be put in patch3 3/21/2002 10:42:34 AM rhawkes The fix apparently didn't work. This is critical to address ASAP. 3/22/2002 9:46:22 AM hbui Gave Walter the script to put the dummy data into the nh-deleted-element-aux. The database got into inconsistent state. However, the inconsistency was caused by other indexing problem. Yulun is working on fixing indexing problem. After that we shall merge the fixes. 3/25/2002 10:38:06 AM hbui Post one-off 4/2/2002 3:42:04 PM rkeville Customer is still having sql errors during rollup, according to the AE on-site. -----Original Message----- From: Keith Scott [mailto:kscott@concord.com] Sent: Tuesday, April 02, 2002 3:20 PM To: Bob Keville Subject: BOA Sql Error Bob, The Sql Error is currently in the /opt/health/log/Statistics_Rollup.100000.log Apr 1, 2002. This is it verbatim: Job started by user at 04/01/2002 20:00:53 ----- $NH_HOME/bin/sys/nhiRollupDb -u $NH_USER -d $NH_RDNMS_NAME ----- Begin processing 04/01/2002 20:00:54 Error: Sql Error occured during operation. ----- Scheduled Job ended at 04/01/2002 20:00:59 ----- (EOF): Keith Scott 4/4/2002 5:47:22 PM rkeville -----Original Message----- From: Keville, Bob Sent: Thursday, April 04, 2002 5:36 PM To: Bui, Ha Cc: Keville, Bob Subject: RE: 21527 - nhiStatsRollup Hi Ha, I had them run a script to get the output from the nh_deleted_element_aux and nh_deleted_element_core tables, there is a mismatch between them. Please see the script and the output file in the Apr04 directory in this call tickets dir. Cheers, -Bob ####################################################### 4/9/2002 11:13:40 AM hbui Sent Bob Keville email to ask for advanced log, and run the script to add dummy data to nh_deleted_element_aux table so that the nhStatsRollup can go on. Haven't heard anything from them yet. 4/16/2002 10:59:30 AM dbrooks should be in more info. 4/16/2002 4:31:32 PM rkeville Wednesday, April 10, 2002 10:30:00 AM rkeville Called and left a message for Bryan asking him to call me back. ################################################################ Thursday, April 11, 2002 2:37:59 PM rkeville Called and left a message asking him to call me back. ############################################################ Thursday, April 11, 2002 2:39:10 PM rkeville -----Original Message----- From: Keville, Bob Sent: Thursday, April 11, 2002 2:28 PM To: 'bryan.mitchell@bankofamerica.com'; Keville, Bob Subject: Open tickets with Concord Hi Bryan, How are we doing with getting the debug for the issues we have open for you folks? Thanks, -Bob #################################################################### Tuesday, April 16, 2002 10:59:30 AM dbrooks Status update from associated Problem Ticket #ProbT0000021527 ==> MoreInfo Tuesday, April 16, 2002 4:29:37 PM rkeville Called and left a message for client asking him for the debug and to please call me to work on these issues. #################################################################### 4/17/2002 3:06:17 PM mgenest Associating call ticket #62277 4/18/2002 10:49:14 AM dbrooks change to field test per escalated ticket meeting. 4/22/2002 11:54:27 AM wburke -----Original Message----- From: Burke, Walter Sent: Monday, April 22, 2002 11:43 AM To: Bui, Ha Subject: FW: Ticket # 61086 - Rollup Failure - PT # 21527 Ha, It appears that insertDummy.sh works ok, but rollups fail again after the next fetch. once again, I have discrepancies between the nh_deleted_eleme< nt_core and aux. Is the work-around to keep running said script? 4/24/2002 9:51:40 AM mwickham Can the one-off be applied to all customers associated with this problem ticket, or is it specifically built for Walter's customer, 61086? 4/25/2002 2:34:56 PM mmcnally -----Original Message----- From: McNally, Mike Sent: Thursday, April 25, 2002 2:23 PM To: Bui, Ha Subject: Rollup Failure - PT # 21527 Hi Ha, I have attached call ticket 62277 to this bug. This customer has a remote polling environment and his rollups are failing with the sql error. Error: Sql Error occured during operation. (dbu/CdbTblElement::removedDeleteStsElement) I noticed that ticket 61086 ran a script "dummy.sh" that took care of the rollup issue until the fetch ran. Should this customerrun the script? If so, should the keep running the script after the fetches as a temporary workaround? Thanks, Mike 4/25/2002 2:47:41 PM hbui Walter, I will put in another one-off in tonight for 502P2. The one-off I gave you before only included the executable nhiRollupDb. It needs to have associated library (libCciWscDbP.so) which it didn't have. When you have the new one-off, could you please ask the customer to run the script I sent to you before they run the remote save and the fetch on all sites. The script will make the entries in nh_deleted_element_core and nh_deleted_element_aux matched on all sites. Then, have them run the nhRemoteSaveDb, nhFetchDb, run sql statement to check the entries in those tables. If they still have the problem, could you have them to run the nhRemoteSaveDb, nhFetchDb with debug and log the outputs for me. For them to go on with the rollup, they just need to run the script after each fetch. Thank you. _Ha Mike the one-off and the script will work for any customer with this problem ( They need to have patch2 though ). 4/26/2002 10:00:23 AM hbui Posted the one-off that includes nhFetchDb.sh, nhRemoteSaveDb, libCciWscDbP.so, and nhiRollupDb. nhiRollupDb shall clean up nh_deleted_element_core and nh_deleted_element_aux as before. Though, if it doesn't find the elements in the nh_deleted_element_aux that are in nh_deleted_element_core, it shall continue instead of prompting error. nhFetchDb and nhRemoteSaveDb shall save/fetch nh_deleted_element_aux. 4/26/2002 10:13:34 AM wburke -----Original Message----- From: Burke, Walter Sent: Friday, April 26, 2002 10:01 AM To: 'virginia.bateman@teleglobe.com' Subject: Ticket # 61086 - Rollup Failure Hi Virginia, I have a permenant fix for this issue. What was happening was that the fetch caused dicrepnacies between the deleted_core and deleted_aux tables, which in turned caused the rollups to fail. I have placed the one-off in ftp.concord.com/outgoing/teleglobe. README file for the customer one-off to ProbT# 19887 ================================================================================ Version: 5.0.2 Patch 2 Problem: Statistics Rollup fails in distributed polling environment. The rollup fails at central site because of failure of deleting deleted elements from remote sites after fetching databases. Note: Instructions: At each site of the distributed polling environment: 1. Save the file that will be replaced. copy NH_HOME/bin/nhRemoteSaveDb to NH_HOME/bin/nhRemoteSaveDb.sav copy NH_HOME/bin/nhFetchDb to NH_HOME/bin/nhFetchDb.sav copy NH_HOME/bin/sys/nhiRollupDb to NH_HOME/bin/sys/nhiRollupDb.sav copy NH_HOME/lib/libCciWscDbP.so to NH_HOME/lib/libCciWscDbP.so.sav 2. Untar and copy nhRemoteSaveDb, nhFetchDb into NH_HOME/bin/, nhiRollupDb into NH_HOME/bin/sys, libCciWscDbP.so into NH_HOME/lib Make sure the file ownership and protection of the replaced file is maintained. 5/1/2002 11:36:02 AM rkeville Sent one-off to BOA. 5/2/2002 5:37:09 PM wburke It turns out we were using the wrong nhFetchDb script on the central. Fixed this and fetch ran OK. 5/3/2002 2:24:36 PM wburke Telecom Brasilia Is Good. Passed Field test. 5/17/2002 10:29:04 AM rkeville Bank of America is still having rollup failures after installing the one-off, requested the roll-up log. ##### 5/20/2002 10:11:24 AM hbui Hi Bob, Could you ask the customer for data in nh_rlp_boundary, nhv_rlp_tables, nhv_stats_tables, and run the nhiRollup with debug, the list of nh_stat* files ( or screen shot of all the tables they have in the database). Thanks, _Ha 5/20/2002 2:35:59 PM rtrei Please let me kow the results one way or another. have a good weekend. -----Original Message----- From: Keville, Bob Sent: Friday, May 17, 2002 4:54 PM To: Trei, Robin Subject: RE: errorlog.log Cool, thanks~! Sorry to trouble you. -Bob -----Original Message----- From: Trei, Robin Sent: Friday, May 17, 2002 4:53 PM To: Keville, Bob Subject: RE: errorlog.log Bob-- This looks like the problem (see below). Since the problem table is an index, it might be easiest to drop the index and then let the index stats job recreate it. You may need to drop it using verifyDb ********************* central_::[ingres , 0000039a]: Fri May 17 09:27:46 2002 E_CL2530_CS_PARAM sec_label_cache = 100 CENTRAL_::[33066 , 00000001]: Fri May 17 09:27:46 2002 E_SC0129_SERVER_UP Ingres Release II 2.0/9808 (su4.us5/00) Server -- Normal Startup. CENTRAL_::[33066 , 0000016f]: Fri May 17 14:51:12 2002 E_DM93A7_BAD_FILE_PAGE_ADDR Page 17 in table nh_stats0_1020967199_ix1, owner: ehealth, database: ehealth, has an incorrect page number: 0. Other page fields: page_stat 00000000, page_log_address (00000000,00000000), page_tran_id (0000000000000000). Corrupted page cannot be read into the server cache. CENTRAL_::[33066 , 0000016f]: Fri May 17 14:51:12 2002 E_DM920C_BM_BAD_FAULT_PAGE Error faulting a page. CENTRAL_::[33066 , 0000016f]: Fri May 17 14:51:12 2002 E_DM9C83_DM0P_CACHEFIX_PAGE An error occurred while fixing a page in the buffer manager. CENTRAL_::[33066 , 0000016f]: Fri May 17 14:51:12 2002 E_DM9204_BM_FIX_PAGE_ERROR Error fixing a page. CENTRAL_::[33066 , 0000016f]: Fri May 17 14:51:12 2002 E_DM92CB_DM1P_ERROR_INFO An error occurred while using the Space Management Scheme on table: nh_stats0_1020967199_ix1, database: ehealth CENTRAL_::[33066 , 0000016f]: Fri May 17 14:51:12 2002 E_DM9206_BM_BAD_PAGE_NUMBER Page number on page doesn't match its location. CENTRAL_::[33066 , 0000016f]: Fri May 17 14:51:12 2002 E_DM930C_DM0P_PAGE_CHKSUM_FAIL Page Checksum failure detected. Database ehealth, Table nh_stats0_1020967199_ix1, Page 2. CENTRAL_::[33066 , 0000016f]: Fri May 17 14:51:12 2002 E_DM920C_BM_BAD_FAULT_PAGE Error faulting a page. CENTRAL_::[33066 , 0000016f]: Fri May 17 14:51:12 2002 E_DM9C83_DM0P_CACHEFIX_PAGE An error occurred while fixing a page in the buffer manager. CENTRAL_::[33066 , 0000016f]: Fri May 17 14:51:12 2002 E_DM9204_BM_FIX_PAGE_ERROR Error fixing a page. CENTRAL_::[33066 , 0000016f]: Fri May 17 14:51:12 2002 E_DM920C_BM_BAD_FAULT_PAGE Error faulting a page. CENTRAL_::[33066 , 0000016f]: Fri May 17 14:51:12 2002 E_DM9263_DM1B_REPLACE Error occurred replacing a record. CENTRAL_::[33066 , 0000016f]: Fri May 17 14:51:12 2002 E_DM9261_DM1B_GET Error occurred getting a record. CENTRAL_::[33066 , 0000016f]: Fri May 17 14:51:12 2002 E_DM904C_ERROR_GETTING_RECORD Error getting a record from database:ehealth, owner:ehealth, table:nh_stats0_1020967199_ix1. CENTRAL_::[33066 , 0000016f]: Fri May 17 14:51:12 2002 E_DM008A_ERROR_GETTING_RECORD Error trying to get a record. Associated error messages which provide more detailed information about the problem can be found in the error log (errlog.log) CENTRAL_::[33066 , 0000016f]: Fri May 17 14:51:12 2002 E_QE007C_ERROR_GETTING_RECORD Error trying to get a record. Associated error messages which provide more detailed information about the problem can be found in the error log (errlog.log) -----Original Message----- From: Keville, Bob Sent: Friday, May 17, 2002 4:49 PM To:< Trei, Robin Subject: errorlog.log Robin, Here is the errlog.log you requested. -Bob -----Original Message----- From: bryan.mitchell@bankofamerica.com [mailto:bryan.mitchell@bankofamerica.com] Sent: Friday, May 17, 2002 4:32 PM To: Keville, Bob Subject: files (See attached file: errlog.log) 5/20/2002 2:40:23 PM rtrei I just wanted to make sure we were all on the same page. I looked at this problem with Bob K on Friday afternoon. It looked like a corrupt table problem to me. They may very well have run out of their work or temporary space (for sorting, etc.) I recommended Bob drop the index. If that doesn't do it, then the table may need to be dropped as well. I forgot to update the Remedy ticket, which has caused Ha to lose some time re-investigating the problem. My apologies for that all around. I have since updated the ticket. I saw Ha entered the following request in the ticket: ************* Hi Bob, Could you ask the customer for data in nh_rlp_boundary, nhv_rlp_tables, nhv_stats_tables, and run the nhiRollup with debug, the list of nh_stat* files ( or screen shot of all the tables they have in the database). Thanks, _Ha *************** At this point, I recommend that you continue with my suggested actions for now, holding Ha's as a followup exercise if needed. 5/22/2002 10:34:18 AM dbrooks closed per escalated ticket meeting 5/22. 3/8/2002 11:27:36 AM Betaprogram beta 5 kit: 3/6 I noticed a core file in the ehealth directory on eh55-atl i did a file on the core and it said that it was from Data Analysis. so i looked in the system.log and found the following: Friday, March 08, 2002 01:00:39 AM Pgm nhiDbServer: Starting job 'Data Analysis' . . . (Job id: 100002, Process id: 1911). Friday, March 08, 2002 01:00:44 AM Pgm nhiDbServer: Command '/eh55b3/bin/sys/nhiStdReport -distMode slave -masterPgmId 131 -masterPid 11639.38 -tz est5e dt -masterHost sander -dsiFileIn $(attachPathname1)' run from cluster member sander. Friday, March 08, 2002 01:00:44 AM Pgm nhiDbServer: Command '/eh55b3/bin/sys/nhiStdReport -distMode slave -masterPgmId 126 -masterPid 11638.38 -tz est5e dt -masterHost sander -dsiFileIn $(attachPathname1)' run from cluster member sander. Friday, March 08, 2002 01:00:49 AM Error nhiRmtOut Pgm nhiRmtOut: Rcs timed out waiting for an ack for message 'CuStsIpcMsg'. Current state is 'needAck' Friday, March 08, 2002 01:04:17 AM Pgm nhiDbServer: Job step 'Data Analysis' failed (the error output was written to /eh55b3/log/Data_Analysis.100002.log Job id: 100002). Friday, March 08, 2002 01:04:17 AM Pgm nhiDbServer: Job 'Data Analysis' finished (Job id: 100002, Process id: 1911). I then looked at the /eh55b3/log/Data_Analysis.100002.log file mentioned in the system log. It contained the following: ----- Job started by User at '03/08/2002 01:00:39 AM' ----- ----- $NH_HOME/bin/sys/nhiDataAnalysis -u $NH_USER -d $NH_RDBMS_NAME ----- Begin processing 03/08/2002 01:00:40 AM. Fatal Error: Assertion for 'Bad' failed, exiting (Table creation : nh_daily_exceptions_1000003 failed! in file ./CdbTblRptConfig.C, line 821). ----- Scheduled Job ended at '03/08/2002 01:04:17 AM'. ----- 3/8/2002 2:37:23 PM Betaprogram The above was due to the partition which contained oracle running out of space. So, the Data Analysis failing isn't an issue. the question is: what, if anything, should eHealth do in this case? This ticket will follow that decision. 3/11/2002 11:20:15 AM rhawkes Robin to review with Joel; probable doc issue. 3/12/2002 11:47:03 AM rhawkes I discussed this with Joel. He is somewhat concerned that the result of running out of disk space is a core dump, but the fact that it's a "controlled" core dump (caused by an assertain failure) makes that acceptable. Customers know that to avoid this they should run SysEdge, so there is no action to take for this ticket. A3/8/2002 12:53:30 PM wburke Thursday, March 07, 2002 02:35:18 PM Error (nhiPoller[Import]) Unable to execute 'set lockmode on nh_dlg0_1015534799 where level=table' (E_RD0060 Cannot access table information due to a non-recoverable DMT_SHOW error (Thu Mar 7 14:35:18 2002) ). Thursday, March 07, 2002 02:35:22 PM Error (nhiPoller[Import]) Unable to execute 'set lockmode on nh_dlg0_1015534799 where level=table' (E_RD0060 Cannot access table information due to a non-recoverable DMT_SHOW error (Thu Mar 7 14:35:22 2002) Monday, March 04, 2002 03:08:06 PM Error (nhiPoller[Dlg]) Unable to execute 'MODIFY nh_dlg0_1015275599 TO BTREE UNIQUE ON sample_time, nap_id, dlg_src_id, proto_id WITH FILLFACTOR = 100, LEAFFILL = 100, NONLEAFFILL = 100' (E_SC0206 An internal error prevents further processing of this query. Associated error messages which provide more detailed information about the problem can be found in the error log, II_CONFIG:errlog.log (Mon Mar 4 15:08:04 2002) ). Failed to commit database transaction (E_LQ002D Association to the dbms has failed. This session should be disconnected.). Massive Problems at Vanguard. See also 59321 All seem to be either TA or Netflow related. 4.8 P08 3,000 Stats Elements 7 probes 45,000 Address Rows 2.2 Million Node Address Pairs on a e250 2 cpus @ 400 mhz 1 gB ram 2 Gb Swap BPM = 500 UNREF = 3/8/2002 4:43:41 PM yzhang run following verifydb drop table, before running the drop table, you need to check if the physical files existed for those two tables by running select file_name from iifile_info where table_name = 'nh_dlg0_1015534799' select file_name from iifile_info where table_name = 'nh_dlg0_1015275599' if the physical file exist, run following verifydb if not exist, you need to touch the file, then run the verifydb drop table. verifydb -mrun -sdbname nethealth -odrop_table nh_dlg0_1015534799 verifydb -mrun -sdbname nethealth -odrop_table nh_dlg0_1015275599 Let me know if this is not clear. Yulun 3/8/2002 4:49:00 PM wburke -----Original Message----- From: Burke, Walter Sent: Friday, March 08, 2002 4:38 PM To: Zhang, Yulun Subject: RE: 21571/61914 Already done: New Error: Friday, March 08, 2002 01:51:04 PM Error (nhImportData) DDI import timed-out waiting for the poller. Friday, March 08, 2002 02:06:08 PM Error (nhImportData) DDI import timed-out waiting for the poller. Friday, March 08, 2002 02:21:08 PM Error (nhImportData) DDI import timed-out waiting for the poller. Friday, March 08, 2002 02:36:22 PM Error (nhImportData) DDI import timed-out waiting for the poller. Friday, March 08, 2002 02:51:36 PM Error (nhImportData) DDI import timed-out waiting for the poller. 3/11/2002 11:28:02 AM wburke -----Original Message----- From: Yi_Jia_Zhang@vanguard.com [mailto:Yi_Jia_Zhang@vanguard.com] Sent: Monday, March 11, 2002 10:11 AM To: Burke, Walter Subject: Re: Ticket # 61064 - importTimeOut Hello Walter, The Import and Conversations Pollers no longer hung after deleting the original Ingres_Log file and creating a new one. Stratacom imports also resumed normally after 7:15pm on Friday. As discussed on the phone, this ticket may be similar to a lot of the other tickets before, in the sense that all the original errors (trying to get Database Status from the GUI lead to database deadlock, BTREE, failure to commit transaction...) were deu to some type of query to the database. Has Engineering been able to come up with an explanation and/or resolution? 3/12/2002 12:24:39 PM wburke -----Original Message----- From: Yi_Jia_Zhang@vanguard.com [mailto:Yi_Jia_Zhang@vanguard.com] Sent: Tuesday, March 12, 2002 12:12 PM To: Burke, Walter Cc: rrick@concord.com Subject: RE: Ticket # 61014 - Unable to modify table Walter, Just got a couple of the very similiar errors in System Messages. No ingres logs seem to have info on this. Tuesday, March 12, 2002 12:03:37 PM Error (nhiPoller[Import]) Unable to execute 'MODIFY nh_dlg0_1015952399 TO BTREE UNIQUE ON sample_time, nap_id, dlg_src_id, proto_id WITH FILLFAC< TOR = 100, LEAFFILL = 100, NONLEAFFILL = 100' (E_US1591 MODIFY: table could not be modified because rows contain duplicate keys. (Tue Mar 12 12:03:35 2002) ). Tuesday, March 12, 2002 12:07:06 PM Error (nhiPoller[Import]) Unable to execute 'MODIFY nh_dlg0_1015952399 TO BTREE UNIQUE ON sample_time, nap_id, dlg_src_id, proto_id WITH FILLFACTOR = 100, LEAFFILL = 100, NONLEAFFILL = 100' (E_US1591 MODIFY: table could not be modified because rows contain duplicate keys. (Tue Mar 12 12:07:04 2002) ). (See attached file: sys0312.log) 3/12/2002 5:11:15 PM wburke -----Original Message----- From: Burke, Walter Sent: Tuesday, March 12, 2002 5:01 PM To: 'Yi_Jia_Zhang@vanguard.com' Cc: Rick, Russell; Jarvis, Rob Subject: Ticket # 61014 YiJia, I spoke with Russ Rick on this and other related issues that Vanguard has been having: Given that Vanguard is setup as follows: 4.8 P08 3,000 Stats Elements 7 probes 45,000 Address Rows 2.2 Million Node Address Pairs Import Polling BOTH CWM and Netflow on a e250 2 cpus @ 400 mhz 1 gB ram 2 Gb Swap There is a clear problem with providing enough physical and swap memory for this type of set-up. Until at least and additional Gb of RAM is added to the machine, or the machine is split into two Separate pollers, we will continue to have issues regarding the stability of the machine. 3/21/2002 10:53:43 AM yzhang talked with Walter, he suggest make this as nobug u3/8/2002 2:48:26 PM Betaprogram beta 5 kit. 3/6 Oracle ran out of space and eHealth started having problems. After we removed some files, oracle recovered fairly nicely and we were once again able to connect to oracle and start ehealth. however, now we are getting the following errors on the console, every time the stats poller polls. Friday, March 08, 2002 12:45:45 PM Error (Statistics poller:) Sql Error occured during operation (ORA-00942: table or view does not exist ). Friday, March 08, 2002 12:45:45 PM Error (Statistics poller): Unable to add 'network element' data to the database, dropping this poll. 3/11/2002 11:25:19 AM rhawkes We need to do an nhRollupDb and bounce the servers. Robin will write up a Tech Tip for Support on this. 3/27/2002 2:29:14 PM rhawkes Added to Tech Tips document. 3/8/2002 3:02:20 PM Betaprogram Logged by Cindy beta 5: 3/6 eh55-atl server After our recovery from oracle running out of space, i'm receiving the following error while trying to run reports (aat, trend) from the web. (the error also occurs from the console, but you don't see this part) Web: assyrian-enet-port-1_2002_03_08_14_43_28_175 Error: Unable to execute 'INSERT INTO SESSION_cdbTempTable12812 SELECT * FROM nh_stats0_1015599599 WHERE element_id = 1000116' (ORA-00942: table or view does not exist). 3/8/2002 3:45:36 PM cboland jay took a quick peek at this and said it's the stats table that doesn't exist. 3/11/2002 11:25:51 AM rhawkes Duplicate of 21583. oT3/8/2002 7:30:40 PM apier This Problem Ticket was spawned from 15738 after a meeting between Engineering (Larry S, Santosh B, Robin T, Matt B) and Support (Tony P) on 3/8/2002 The problems are similar as they appear to be caused by corrupt or stale DB cache. This new ticket will be used to track the Assertion error and the original case will be used to track the Expectation error that appear in the text below. The new Call Ticket associated is 61330 15738 will be de-escalated and this case will be escalated. ############################################################################################################# Servers crashed during the import from CWM I was working on Call Ticket # 59110. We were attempting to perform an nhSvGetConfig for about 8000 CWM elements. Important facts: 1) Error messages in the console - Friday, March 01, 2002 04:37:20 PM Error nhiCfgServer Pgm nhiCfgServer: Unable to find 'yrk-ram1-b.inet-RH-Cpu-1' (the name '' does not exist).Friday, March 01, 2002 04:37:35 PM Internal Error nhiCfgServer Pgm nhiCfgServer: Assertion for 'elemPtr' failed, exiting (in file ../CfgServer.C, line 2752). Friday, March 01, 2002 04:37:54 PM System Event nhiCfgServer Server started successfully. Friday, March 01, 2002 04:37:58 PM System Event nhiConsole Console initialization complete. Friday, March 01, 2002 04:38:00 PM Host pedro: Pgm nhiArControl: Controller has started. Product version is 5.0.2.0.1008.. 2) The merge uses a DCI rule fille that calls on grouping functionality Performed an nhConfig with a -verify option without groupings - builds a DCI file no problem Performed an nhCOnfig with a -verify option with grouping and get errors: Expectation for '_dbId' failed (in file ../esdObject.C, line 365). (cu/)" This is the same as 15738 - need to escalate that one now. =========================================================================== Performed an nhConfig with a -verify option without grouping DCI rule file - builds a DCI file no problem Performed an nhConfig with a -verify option with grouping DCI file and get errors: Expectation for '_dbId' failed (in file ../esdObject.C, line 365). (cu/)" ######################################################################## Santosh has been able to reproduce the Assertion for 'elemPtr' failed error and beleives that fixing this problem will correct the Expectation for '_dbId' failed Thursday, March 07, 2002 11:12:36 AM sbalagopalan Attached are the dci files needed to reproduce the problem. Here are the steps: 1) Start with clean db and clean poller.cfg (delete all elements via pollerUI). 2) nhConfig -dciIn indexShiftUp_cfg.dci (imports six elements) 3) nhConfig -dciIn indexShiftUp.cfg.new.dci (imports eight elements of which two are new) Step 3 (2nd import) should produce assertion in CfgServer::setDbIds() //File indexShiftUp_cfg.dci IR,2,use:testing,0,03/06/2002,14:53:44, FT,GlobalInfo,symbol,symbol,string,symbol,symbol,symbol,symbol,symbol,symbol,symbol,symbol,symbol FN,GlobalInfo,nmsSrc,host,addrPattern,addrFile,exclFile,community,findMib2,createNew,mergeSource,commit,modes,deletePaths FT,Elements,symbol,symbol,symbol,string,symbol,string,string,string,string,string,double,double,symbol,symbol,ipAddr,symbol,symbol,symbol,string,symbol,symbol,symbol,symbol,symbol,string,symbol,ipAddr,symbol,symbol,symbol,symbol,symbol,symbol,symbol,symbol,symbol,double,double,string,symbol,string,string,symbol,symbol,string,string,string,symbol,symbol,symbol,ipAddr,symbol,string,integer,string,string,integer,integer,string,integer,symbol FN,Elements,objId,name,dbId,nmsId,poll,sysContact,sysDescription,sysName,sysLocation,ifDescription,speedIn,speedOut,cktEndpoint,mibTranslationFile,ipAddr,readCommunity,writeCommunity,storeInDb,ifType,uniqueDevId,index1,index2,index3,index4,possibleLatencySources,latencySource,latencyPartner,nmsSource,pollRate,alias,isNameUserSupplied,discoverMtf,nmsName,nmsState,dataSourceCapabilities,protocolCfgSymbol,deviceSpeedIn,deviceSpeedOut,copyOf,enterpriseId,processGroupName,processName,aggregateAvailability,argsRequired,processArgs,sideA,sideZ,timeZone,monitorLiveEx,ifPhysAddress,ifIpAddress,appType,appKey,responseLimit,discoverKey,caption,fullDuplex,incInLwRpts,userString,machineId,clientAccess FT,ElementGroups,symbol,symbol,symbol FN,ElementGroups,objId,groupName,elementObjId FT,Groups,symbol,symbol,symbol FN,Groups,groupId,name,groupType FT,GroupContents,symbol,symbol,symbol FN,GroupContents,rowId,groupId,elementId FT,GroupLists,symbol,symbol,symbol FN,GroupLists,groupListId,name,groupType FT,GroupListContents,symbol,symbol,symbol FN,GroupListContents,rowId,groupListId,groupId FT,Parents,symbol,symbol,symbol FN,Parents,objId,elementObjId,parentObjId FT,Associations,symbol,symbol,symbol,integer FN,Associations,objId,originElemObjId,destElemObjId,assocType FT,DiscoverAddrs,ipAddr FN,DiscoverAddrs,addr FT,DiscoverPorts,integer FN,DiscoverPorts,port FT,Operations,symbol,symbol,symbol,symbol FN,Operations,operator,section,objId1,objId2 DS,,El< ements, 0,"Keyed-D1",0,,Yes,,,"Keyed-D1",,"Core Router Stats",,,,cisco-rh-rtr.mtf,192.124.15.1,private,,,"Router",UDID1,0,"",,,,,,NH:Discover,,,,cisco-rh-rtr.mtf,,,,,,,,,,,,,,,,,,,,,,,,,,,,, 1,"Keyed-D1C1",0,"Keyed-D1 enet-port IFD-1",Yes,,,"Keyed-D1",,"IFD-1",,,,ciscoMib2-lan.mtf,192.124.15.1,private,,,"ethernet-csmacd",UDID1,1,"",,,,,,NH:Discover,,,,ciscoMib2-lan.mtf,,,,,,,,,,,,,,,,,,,,,,,,,,,,, 2,"Keyed-D1C3",0,"Keyed-D1 enet-port IFD-3",Yes,,,"Keyed-D1",,"IFD-3",,,,ciscoMib2-lan.mtf,192.124.15.1,private,,,"ethernet-csmacd",UDID1,2,"",,,,,,NH:Discover,,,,ciscoMib2-lan.mtf,,,,,,,,,,,,,,,,,,,,,,,,,,,,, 3,"Unkeyed-D1",0,,Yes,,,"Unkeyed-D1",,"Core Router Stats",,,,cisco-rh-rtr.mtf,192.124.15.2,private,,,"Router",UDID2,0,"",,,,,,NH:Discover,,,,cisco-rh-rtr.mtf,,,,,,,,,,,,,,,,,,,,,,,,,,,,, 4,"Unkeyed-D1C1",0,,Yes,,,"Unkeyed-D1",,"IFD-1",,,,ciscoMib2-lan.mtf,192.124.15.2,private,,,"ethernet-csmacd",UDID2,1,"",,,,,,NH:Discover,,,,ciscoMib2-lan.mtf,,,,,,,,,,,,,,,,,,,,,,,,,,,,, 5,"Unkeyed-D1C3",0,,Yes,,,"Unkeyed-D1",,"IFD-3",,,,ciscoMib2-lan.mtf,192.124.15.2,private,,,"ethernet-csmacd",UDID2,2,"",,,,,,NH:Discover,,,,ciscoMib2-lan.mtf,,,,,,,,,,,,,,,,,,,,,,,,,,,,, DE DS,,Parents, 0,0,0 1,1,0 2,2,0 3,3,3 4,4,3 5,5,3 DE DS,,Operations, merge,Elements,, DE //File indexShiftUp.cfg.new.dci IR,2,use:testing,0,03/06/2002,14:53:56, FT,GlobalInfo,symbol,symbol,string,symbol,symbol,symbol,symbol,symbol,symbol,symbol,symbol,symbol FN,GlobalInfo,nmsSrc,host,addrPattern,addrFile,exclFile,community,findMib2,createNew,mergeSource,commit,modes,deletePaths FT,Elements,symbol,symbol,symbol,string,symbol,string,string,string,string,string,double,double,symbol,symbol,ipAddr,symbol,symbol,symbol,string,symbol,symbol,symbol,symbol,symbol,string,symbol,ipAddr,symbol,symbol,symbol,symbol,symbol,symbol,symbol,symbol,symbol,double,double,string,symbol,string,string,symbol,symbol,string,string,string,symbol,symbol,symbol,ipAddr,symbol,string,integer,string,string,integer,integer,string,integer,symbol FN,Elements,objId,name,dbId,nmsId,poll,sysContact,sysDescription,sysName,sysLocation,ifDescription,speedIn,speedOut,cktEndpoint,mibTranslationFile,ipAddr,readCommunity,writeCommunity,storeInDb,ifType,uniqueDevId,index1,index2,index3,index4,possibleLatencySources,latencySource,latencyPartner,nmsSource,pollRate,alias,isNameUserSupplied,discoverMtf,nmsName,nmsState,dataSourceCapabilities,protocolCfgSymbol,deviceSpeedIn,deviceSpeedOut,copyOf,enterpriseId,processGroupName,processName,aggregateAvailability,argsRequired,processArgs,sideA,sideZ,timeZone,monitorLiveEx,ifPhysAddress,ifIpAddress,appType,appKey,responseLimit,discoverKey,caption,fullDuplex,incInLwRpts,userString,machineId,clientAccess FT,ElementGroups,symbol,symbol,symbol FN,ElementGroups,objId,groupName,elementObjId FT,Groups,symbol,symbol,symbol FN,Groups,groupId,name,groupType FT,GroupContents,symbol,symbol,symbol FN,GroupContents,rowId,groupId,elementId FT,GroupLists,symbol,symbol,symbol FN,GroupLists,groupListId,name,groupType FT,GroupListContents,symbol,symbol,symbol FN,GroupListContents,rowId,groupListId,groupId FT,Parents,symbol,symbol,symbol FN,Parents,objId,elementObjId,parentObjId FT,Associations,symbol,symbol,symbol,integer FN,Associations,objId,originElemObjId,destElemObjId,assocType FT,DiscoverAddrs,ipAddr FN,DiscoverAddrs,addr FT,DiscoverPorts,integer FN,DiscoverPorts,port FT,Operations,symbol,symbol,symbol,symbol FN,Operations,operator,section,objId1,objId2 DS,,Elements, 0,"Keyed-D1",0,,Yes,,,"Keyed-D1",,"Core Router Stats",,,,cisco-rh-rtr.mtf,192.124.15.1,private,,,"Router",UDID1,0,"",,,,,,NH:Discover,,,,cisco-rh-rtr.mtf,,,,,,,,,,,,,,,,,,,,,,,,,,,,, 1,"Keyed-D1C1",0,"Keyed-D1 enet-port IFD-1",Yes,,,"Keyed-D1",,"IFD-1",,,,ciscoMib2-lan.mtf,192.124.15.1,private,,,"ethernet-csmacd",UDID1,1,"",,,,,,NH:Discover,,,,ciscoMib2-lan.mtf,,,,,,,,,,,,,,,,,,,,,,,,,,,,, 2,"Keyed-D1C2",0,"Keyed-D1 enet-port IFD-2",Yes,,,"Keyed-D1",,"IFD-2",,,,ciscoMib2-lan.mtf,192.124.15.1,private,,,"ethernet-csmacd",UDID1,2,"",,,,,,NH:Discover,,,,ciscoMib2-lan.mtf,,,,,,,,,,,,,,,,,,,,,,,,,,,,, 3,"Keyed-D1C3",0,"Keyed-D1 enet-port IFD-3",Yes,,,"Keyed-D1",,"IFD-3",,,,ciscoMib2-lan.mtf,192.124.15.1,private,,,"ethernet-csmacd",UDID1,3,"",,,,,,NH:Discover,,,,ciscoMib2-lan.mtf,,,,,,,,,,,,,,,,,,,,,,,,,,,,, 4,"Unkeyed-D1",0,,Yes,,,"Unkeyed-D1",,"Core Router Stats",,,,cisco-rh-rtr.mtf,192.124.15.2,private,,,"Router",UDID2,0,"",,,,,,NH:Discover,,,,cisco-rh-rtr.mtf,,,,,,,,,,,,,,,,,,,,,,,,,,,,, 5,"Unkeyed-D1C1",0,,Yes,,,"Unkeyed-D1",,"IFD-1",,,,ciscoMib2-lan.mtf,192.124.15.2,private,,,"ethernet-csmacd",UDID2,1,"",,,,,,NH:Discover,,,,ciscoMib2-lan.mtf,,,,,,,,,,,,,,,,,,,,,,,,,,,,, 6,"Unkeyed-D1C2",0,,Yes,,,"Unkeyed-D1",,"IFD-2",,,,ciscoMib2-lan.mtf,192.124.15.2,private,,,"ethernet-csmacd",UDID2,2,"",,,,,,NH:Discover,,,,ciscoMib2-lan.mtf,,,,,,,,,,,,,,,,,,,,,,,,,,,,, 7,"Unkeyed-D1C3",0,,Yes,,,"Unkeyed-D1",,"IFD-3",,,,ciscoMib2-lan.mtf,192.124.15.2,private,,,"ethernet-csmacd",UDID2,3,"",,,,,,NH:Discover,,,,ciscoMib2-lan.mtf,,,,,,,,,,,,,,,,,,,,,,,,,,,,, DE DS,,Parents, 0,0,0 1,1,0 2,2,0 3,3,0 4,4,4 5,5,4 6,6,4 7,7,4 DE DS,,Operations, merge,Elements,, DE //Other observations Running the indexShiftUp regression test consistently produces an assertion in CfgServer::setDbIds (line 2793) on Solaris (EH 5.5). The assertion occurs because the poller cfg has elements not present in the DB. However, running the same DCI files (following the test steps) on my 5.0 NT installation (Ingres) doesn't produce the assertion. While debugging the cfgServer, Santosh and I observed that the number of entries in _dbTrans->_editList changed inexplicably. The test sequence invokes the following: delete all elements (clean db/poller cfg), import 6 elements and import 8 elements. The last import has 2 new elements (the other six match the originals). We witnessed _editList entries going from 0 to 6 to 8 during the merge and back to 6 in setDbIds(). Here's some interesting findings... The execTransCb() in cdtDbElemTrans.C calls clearTrans() which clears _editList (and _opList). The _editList is repopulated by initTrans(). I would also look at getEdlCb() in cdtDbElemTrans.C which manipulates the edit list (_dbTrans->_editList). It seems that the execTrans is failing to commit the DB changes (hence, _editList only gets refilled with the original elements). We probably need to debug the dbServer and consult with Robin (or someone else from the DB group). Thursday, March 07, 2002 5:29:06 PM rtrei Matt, Santosh-- Just an update, that the reporducable case Brett discovered last night does not seem to be reproducible on my 5.5 system. Brett will be reinstalling his 5.5 ehealth and seeing if he can still reproduce it. Meanwhile, Brett was unable to reproduce the situation on his 5.0 NT box anyway. I am in the process of installing a 5.0 sun system to try and reproduce it there. I will let you know results tomorrow, but it is starting to look like we are back to square 1. 3/11/2002 4:29:37 PM rtrei A new nhiDbServer was created which disabled the element cache. It was created on sun 2.7 against the 5.0.2 patch 2 stream (which is what the customer has) Santosh reviewed the code. I have tested it by running the dci scipts listed above and also by doing a discover and letting it poll and run for several hours. Tony P will make a ocpy of this available to the customer. This is a test rather than a true one-off. If this works for the customer we will know to implement a more time-sensitive cache checking algorithm. (Which will take a fair amount of time.) 3/18/2002 8:24:42 AM apier We put the nhiDbServer in place on 3/11/2002. Everything ran fine until the weekend. Servers crashed again - catastrophically! Between 3:13:15 AM on 3/16/2002 and 7:33:18 on 3/17/2002 the servers bounced up and down every frew minutes. (system.messages, system.messages.bak, system.messages.bak.bak) An excerpt of this behavior is below. No cooresponding entries in the ingres errlog.log file. One thing to note is that periodically the Fetch DB job kicks off during the startup proced< ure but never finishes. Due to this we ended up with a set of duplicate elemnts for the remotely polled elements. One strange thing is that the nxt_hndl for elements climbed into the 5 million range. We will have to reset this once we get the duplicates removed. =============================================================================================== Saturday, March 16, 2002 03:13:15 AM System Event nhiCfgServer Server started successfully. Saturday, March 16, 2002 03:13:24 AM Host pedro: Pgm nhiArControl: Controller has started. Product version is 5.0.2.0.1008.. Saturday, March 16, 2002 03:13:24 AM System Event nhiConsole Console initialization complete. Saturday, March 16, 2002 03:16:47 AM Internal Error nhiCfgServer Pgm nhiCfgServer: Assertion for 'elemPtr' failed, exiting (in file ../CfgServer.C, line 2752). Saturday, March 16, 2002 03:17:05 AM System Event nhiCfgServer Server started successfully. Saturday, March 16, 2002 03:17:09 AM System Event nhiConsole Console initialization complete. Saturday, March 16, 2002 03:17:12 AM Host pedro: Pgm nhiArControl: Controller has started. Product version is 5.0.2.0.1008.. Saturday, March 16, 2002 03:20:10 AM Internal Error nhiCfgServer Pgm nhiCfgServer: Assertion for 'elemPtr' failed, exiting (in file ../CfgServer.C, line 2752). Saturday, March 16, 2002 03:20:29 AM System Event nhiCfgServer Server started successfully. Saturday, March 16, 2002 03:20:33 AM System Event nhiConsole Console initialization complete. Saturday, March 16, 2002 03:20:36 AM Host pedro: Pgm nhiArControl: Controller has started. Product version is 5.0.2.0.1008.. Saturday, March 16, 2002 03:23:31 AM Internal Error nhiCfgServer Pgm nhiCfgServer: Assertion for 'elemPtr' failed, exiting (in file ../CfgServer.C, line 2752). Saturday, March 16, 2002 03:23:50 AM System Event nhiCfgServer Server started successfully. Saturday, March 16, 2002 03:23:54 AM System Event nhiConsole Console initialization complete. Saturday, March 16, 2002 03:23:56 AM Host pedro: Pgm nhiArControl: Controller has started. Product version is 5.0.2.0.1008.. Saturday, March 16, 2002 03:26:54 AM Internal Error nhiCfgServer Pgm nhiCfgServer: Assertion for 'elemPtr' failed, exiting (in file ../CfgServer.C, line 2752). Saturday, March 16, 2002 03:27:12 AM System Event nhiCfgServer Server started successfully. Saturday, March 16, 2002 03:27:17 AM System Event nhiConsole Console initialization complete. Saturday, March 16, 2002 03:27:19 AM Host pedro: Pgm nhiArControl: Controller has started. Product version is 5.0.2.0.1008.. Saturday, March 16, 2002 03:30:13 AM Pgm nhiMsgServer: Starting job 'Fetch Database' . . . (Job id: 1000007, Process id: 15398). Saturday, March 16, 2002 03:30:15 AM Internal Error nhiCfgServer Pgm nhiCfgServer: Assertion for 'elemPtr' failed, exiting (in file ../CfgServer.C, line 2752). Saturday, March 16, 2002 03:30:35 AM System Event nhiCfgServer Server started successfully. Saturday, March 16, 2002 03:30:40 AM Host pedro: Pgm nhiArControl: Controller has started. Product version is 5.0.2.0.1008.. Saturday, March 16, 2002 03:30:46 AM System Event nhiConsole Console initialization complete. Saturday, March 16, 2002 03:33:42 AM Internal Error nhiCfgServer Pgm nhiCfgServer: Assertion for 'elemPtr' failed, exiting (in file ../CfgServer.C, line 2752). Saturday, March 16, 2002 03:34:01 AM System Event nhiCfgServer Server started successfully. Saturday, March 16, 2002 03:34:04 AM System Event nhiConsole Console initialization complete. Saturday, March 16, 2002 03:34:07 AM Host pedro: Pgm nhiArControl: Controller has started. Product version is 5.0.2.0.1008.. Saturday, March 16, 2002 03:37:07 AM Internal Error nhiCfgServer Pgm nhiCfgServer: Assertion for 'elemPtr' failed, exiting (in file ../CfgServer.C, line 2752). Saturday, March 16, 2002 03:37:27 AM System Event nhiCfgServer Server started successfully. Saturday, March 16, 2002 03:37:32 AM Host pedro: Pgm nhiArControl: Controller has started. Product version is 5.0.2.0.1008.. Saturday, March 16, 2002 03:37:34 AM System Event nhiConsole Console initialization complete. Saturday, March 16, 2002 03:40:34 AM Internal Error nhiCfgServer Pgm nhiCfgServer: Assertion for 'elemPtr' failed, exiting (in file ../CfgServer.C, line 2752). Saturday, March 16, 2002 03:40:53 AM System Event nhiCfgServer Server started successfully. Saturday, March 16, 2002 03:40:56 AM System Event nhiConsole Console initialization complete. Saturday, March 16, 2002 03:41:00 AM Host pedro: Pgm nhiArControl: Controller has started. Product version is 5.0.2.0.1008.. Saturday, March 16, 2002 03:44:09 AM Internal Error nhiCfgServer Pgm nhiCfgServer: Assertion for 'elemPtr' failed, exiting (in file ../CfgServer.C, line 2752). Saturday, March 16, 2002 03:44:27 AM System Event nhiCfgServer Server started successfully. Saturday, March 16, 2002 03:44:31 AM System Event nhiConsole Console initialization complete. Saturday, March 16, 2002 03:44:34 AM Host pedro: Pgm nhiArControl: Controller has started. Product version is 5.0.2.0.1008.. Saturday, March 16, 2002 03:47:43 AM Internal Error nhiCfgServer Pgm nhiCfgServer: Assertion for 'elemPtr' failed, exiting (in file ../CfgServer.C, line 2752). =============================================================================================== 3/18/2002 10:23:26 AM wburke Attached ticket # 61631, Was temporarily resolved by loaded a saved db. Customer will refrain from New Discovery or COnfig changes until Bug is Fixed. 3/29/2002 3:31:09 PM apier Associated call ticket # 61330 closed as the customer moved away from using DP. De-escalating 4/24/2002 7:13:20 AM hbui Robin put in the fix for deadlock problem when fetching database. We will put this fix in patch3. 5/10/2002 5:39:16 PM hbui Fix is put in patch3 5/13/2002 10:19:14 AM mmcnally -----Original Message----- From: McNally, Mike Sent: Monday, May 13, 2002 10:17 AM To: Bui, Ha Subject: PT 21512 nhiCfgServer crashes during CWM import Ha, A customer is seeing the error below: Friday, May 10, 2002 01:34:07 AM Internal Error nhiCfgServer Pgm nhiCfgServer: Assertion for 'elemPtr' failed, exiting (in file ../CfgServer.C, line 2752). This was attached to bug 21512, but the customer is not importing. Should this be associated with this? If so, is this fixed in P03? Thanks, Mike 5/22/2002 10:11:14 AM mmcnally -----Original Message----- From: McNally, Mike Sent: Wednesday, May 22, 2002 10:10 AM To: Bui, Ha Subject: FW: PT 21512 nhiCfgServer crashes during CWM import Ha, You never replied to this. This customer is getting upset so I need to know if this should be logged as a new bug because they are not importing or if P03 will fix this. Thanks, Mike -----Original Message----- From: McNally, Mike Sent: Monday, May 13, 2002 10:17 AM To: Bui, Ha Subject: PT 21512 nhiCfgServer crashes during CWM import Ha, A customer is seeing the error below: Friday, May 10, 2002 01:34:07 AM Internal Error nhiCfgServer Pgm nhiCfgServer: Assertion for 'elemPtr' failed, exiting (in file ../CfgServer.C, line 2752). This was attached to bug 21512, but the customer is not importing. Should this be associated with this? If so, is this fixed in P03? Thanks, Mike 5/23/2002 10:15:57 AM mmcnally -----Original Message----- From: McNally, Mike Sent: Thursday, May 23, 2002 10:15 AM To: Bui, Ha Subject: RE: PT 21512 nhiCfgServer crashes during CWM import Ha, Here is the output of that command, looks like everything matches. INGRES TERMINAL MONITOR Copyright (c) 1981, 1998 Computer Associates Intl, Inc. Ingres SPARC SOLARIS Version II 2.0/9808 (su4.us5/00) login Wed May 22 20:23:22 2002 continue * Executing . . . +-------------+ |element_id | +-------------+ +-------------+ (0 rows) continue * Your SQL statement(s) have been committed. Ingres Version II 2.0/9808 (su4.us5/00) logout Wed May 22 20:23:54 2002 Thanks, Mike 5/30/2002 1:29:29 PM hbui Since Mike opened another ticket for error elemPtr and I aready put in the fix for the original problem in patch< 3 , I'll mark this as fixed. E3/11/2002 11:51:48 AM dwaterson 5.0.2 P1, D1 Solaris 2.7 Issue: The eHealth - DB became inconsistent. This happened twice on exactly the same time (per ICS) In the errlog.log you see following messages: - no more logical locks (was meanwhile increased from 700 --> 1400) was not critical, then DB continued to run - deadlock encountered locking table health.nh_import_poll_info probably the beginning of all problems - Error updating iirelation system table on database ehealth - DB has become inconsistent SEE BAFS: 61200/61281 for the complete error log Excerpt from error log: 6 18:30:33 2002 E_DM9026_REL_UPDATE_ERR Error updating iirelation system table on database ehealth while executing some other operation such as modify, index, etc. Check consistency of system catalogs. KKAAH027::[38997 , 0000084b]: Wed Mar 6 18:30:33 2002 E_DM9042_PAGE_DEADLOCK Deadlock encountered locking page 164 for table iirelation in database ehealth with mode 5. Resource held by session [15054 84a]. KKAAH027::[38997 , 0000084b]: Wed Mar 6 18:30:33 2002 E_DM0042_DEADLOCK Resource deadlock. KKAAH027::[38997 , 0000084b]: Wed Mar 6 18:30:33 2002 E_DM9026_REL_UPDATE_ERR Error updating iirelation system table on database ehealth while executing some other operation such as modify, index, etc. Check consistency of system catalogs. KKAAH027::[38997 , 0000084b]: Wed Mar 6 18:30:33 2002 E_DM9042_PAGE_DEADLOCK Deadlock encountered locking page 164 for table iirelation in database ehealth with mode 5. Resource held by session [15054 84a]. KKAAH027::[38997 , 0000084b]: Wed Mar 6 18:30:33 2002 E_DM0042_DEADLOCK Resource deadlock. KKAAH027::[38997 , 0000084b]: Wed Mar 6 18:30:33 2002 E_DM9026_REL_UPDATE_ERR Error updating iirelation system table on database ehealth while executing some other operation such as modify, index, etc. Check consistency of system catalogs. KKAAH027::[38997 , 0000084b]: Wed Mar 6 18:30:33 2002 E_DM9042_PAGE_DEADLOCK Deadlock encountered locking page 164 for table iirelation in database ehealth with mode 5. Resource held by session [15054 84a]. KKAAH027::[38997 , 0000084b]: Wed Mar 6 18:30:33 2002 E_DM0042_DEADLOCK Resource deadlock. KKAAH027::[38997 , 0000084b]: Wed Mar 6 18:30:33 2002 E_DM9026_REL_UPDATE_ERR Error updating iirelation system table on database ehealth while executing some other operation such as modify, index, etc. Check consistency of system catalogs. ::[II_RCP , 00000005]: Wed Mar 6 18:30:34 2002 E_DM9042_PAGE_DEADLOCK Deadlock encountered locking page 164 for table iirelation in database ehealth with mode 5. Resource held by session [15054 84a]. ::[II_RCP , 00000005]: Wed Mar 6 18:30:34 2002 E_DM0042_DEADLOCK Resource deadlock. ::[II_RCP , 00000005]: Wed Mar 6 18:30:34 2002 E_DM960D_DMVE_PUT Error recovering PUT operation. ::[II_RCP , 00000005]: Wed Mar 6 18:30:34 2002 E_DM9638_DMVE_REDO An error occurred during REDO recovery. ::[II_RCP , 00000005]: Wed Mar 6 18:30:34 2002 E_DM9439_APPLY_REDO Error applying REDO operation. ::[II_RCP , 00000005]: Wed Mar 6 18:30:34 2002 E_DM943D_RCP_DBREDO_ERROR Recovery error on Database ehealth. Error occurred applying Redo recovery for log record with LSN (1015230824,81268984). Recovery will be halted on this database while the RCP attempts to successfully recover other open databases. ::[II_RCP , 00000005]: Wed Mar 6 18:30:34 2002 E_DM943B_RCP_DBINCONSISTENT Database (ehealth, health) being marked inconsistent by the recovery process. The database could not be successfully restored following a system, process, or transaction failure. The database should be restored from a previous checkpoint. KKAAH027::[38997 , 0000084a]: Wed Mar 6 18:30:34 2002 E_DM9331_DM2T_TBL_IDX_MISMATCH Base table missing or indices missing. KKAAH027::[38997 , 0000084a]: Wed Mar 6 18:30:34 2002 E_DM9C8B_DM2T_TBL_INFO An error occurred while attempting to build the Table Control Block for table (3038,0) in database ehealth. KKAAH027::[38997 , 0000084a]: Wed Mar 6 18:30:34 2002 E_DM9C89_DM2T_BUILD_TCB An error occurred while building a Table Control Block for a table. KKAAH027::[38997 , 0000084a]: Wed Mar 6 18:30:34 2002 E_DM9C8A_DM2T_FIX_TCB An error occurred while trying to locate and/or build the Table Control Block for a table. KKAAH027::[38997 , 0000084a]: Wed Mar 6 18:30:34 2002 E_DM0166_AGG_ADE_FAILED Execution of ADE control block in DMF Aggregate Processor failed. KKAAH027::[38997 , 0000084a]: Wed Mar 6 18:30:34 2002 E_AD2103_ALLOCATED_FCN_ERR The callback function for the iitotal_allocated_pages function returned an error check the DBMS error log for more information KKAAH027::[38997 , 0000084a]: Wed Mar 6 18:30:34 2002 E_SC0216_QEF_ERROR Error returned by QEF. KKAAH027::[38997 , 0000084a]: Wed Mar 6 18:30:34 2002 E_SC0206_CANNOT_PROCESS An internal error prevents further processing of this query. Associated error messages which provide more detailed information about the problem can be found in the error log, II_CONFIG:errlog.log KKAAH027::[38997 , 0000084a]: Wed Mar 6 18:30:34 2002 E_CL0F0F_LG_DB_INCONSISTENT One of the %s databases has become inconsistent. No new transactions may be begun against the inconsistent database, and no new update operations against the inconsistent database may be performed by transactions already underway. KKAAH027::[38997 , 0000084a]: Wed Mar 6 18:30:34 2002 E_DM900C_BAD_LOG_BEGIN Error trying to begin a transaction on the database 004E0083. KKAAH027::[38997 , 0000084a]: Wed Mar 6 18:30:34 2002 E_DM9500_DMXE_BEGIN Error occurred beginning a transaction. KKAAH027::[38997 , 0000084a]: Wed Mar 6 18:30:34 2002 E_QE0025_USER_ERROR A user error has occurred. KKAAH027::[38997 , 0000084a]: Wed Mar 6 18:30:34 2002 E_CL0F0F_LG_DB_INCONSISTENT One of the %s databases has become inconsistent. No new transactions may be begun against the inconsistent database, and no new update operations against the inconsistent database may be performed by transactions already underway. 3/18/2002 2:47:56 PM mmcnally -----Original Message----- From: McNally, Mike Sent: Monday, March 18, 2002 2:37 PM To: Trei, Robin Subject: PT 21601 Inconsistent database caused by iirelation table deadlocks Robin, Do you have an update on this one? Thanks, Mike 3/19/2002 1:08:01 PM mwickham -----Original Message----- From: Gray, Don Sent: Tuesday, March 19, 2002 11:58 AM To: Wickham, Mark Subject: FW: FW: 61281 - RE: ICSREQ003891 Database inconsistency and failure whenDB reload Mark, Till needs action on this and the bug should be escalated. Don 3/21/2002 11:52:13 AM mmcnally -----Original Message----- From: McNally, Mike Sent: Thursday, March 21, 2002 11:41 AM To: Trei, Robin Subject: PT 21601 Inconsistent database caused by iirelation table deadlocks Hi Robin, Sorry to bug you on this one. I know your very busy. Do you think we will be able to update this customer by the end of today? They are constantly asking for an update. Thanks, Mike 3/21/2002 1:44:51 PM mmcnally -----Original Message----- From: Trei, Robin Sent: Thursday, March 21, 2002 11:58 AM To: McNally, Mike Subject: RE: PT 21601 Inconsistent database caused by iirelation table deadlocks No problem, it is an escalated ticket. My apologies for not getting back to you sooner. Normally I would have responded instantly for an issue like this. I would like you to do an nhCollectCustData on this and send me the results. I would also like as much of a history as the customer can give as to when, what sequence this happened. Where they first saw this, etc. Anything they can think of that might be helpful. If this is already here, just point me to it. This is one of my top priorities for the next few days. Hav< e you gotten the customer back up and running? If not, do you need help doing that? -----Original Message----- From: Trei, Robin Sent: Thursday, March 21, 2002 12:34 PM To: McNally, Mike Subject: RE: PT 21601 Inconsistent database caused by iirelation table deadlocks One other questin I had-- is this customer using distributed polling (aka remote polling) at this site? Thursday, March 21, 2002 1:44:06 PM mmcnally -----Original Message----- From: McNally, Mike Sent: Thursday, March 21, 2002 1:33 PM To: Trei, Robin Subject: RE: PT 21601 Inconsistent database caused by iirelation table deadlocks Robin, I have requested the nhCollectCustData from the customer and asked if they are using dist.polling. I will update you when I hear back from them. They are currently up and running. This has happened to them a couple times since March 8th. So they want to know how to prevent this from happening again, hence the escalated bug..... Below is the history of how this came about. The eHealth - DB became inconsistent. In the errlog.log you see following messages: - no more logical locks (was meanwhile increased from 700 --> 1400) was not critical, then DB continued to run - deadlock encountered locking table health.nh_import_poll_info probably the beginning of all problems - Error updating iirelation system table on database ehealth DB has become inconsistent Please see the errlog.log in seperate mail. Following actions were invoked: - nhForceDb - nhSaveDb (in the save.log you see the incorrect table) - drop table (with verifydb -mrun sdbname "ehealth" -odrop_table ) - nhSaveDb - destroyDb - createDb - nhLoadDb (here were different failures but a load eventually was successful) The logs they sent in originally when opening the call ticket are on BAFS/61000/61281/logs. They include the save.log, load.log, and errlog.log. As soon as I receive the nhCollectCustData I will place it in the same location and notify you. Thanks, Mike 3/22/2002 10:54:15 AM mmcnally -----Original Message----- From: McNally, Mike Sent: Friday, March 22, 2002 10:43 AM To: Trei, Robin Subject: PT 21601 Inconsistent database caused by iirelation table deadlocks Robin, I have just received the requested log files and placed them on BAFS/61000/61281. The customer says they are not using distributed polling but they are importing from Strataview. Let me know if you need anything else. Thanks, Mike 3/28/2002 9:47:39 AM mmcnally -----Original Message----- From: McNally, Mike Sent: Thursday, March 28, 2002 9:36 AM To: 'support@ics.de' Subject: ICSREQ003891- CCRD 61281 "Database inconsistency and failure whenDB reload" Support, This is in regard to call ticket ICSREQ003891- CCRD 61281 "Database inconsistency and failure whenDB reload" Please have the customer run the following command from $NH_HOME and forward the resulting statDump.out file. statdump -zc $NH_RDBMS_NAME > statDump.out Regards, Mike Requested the above from customer. 3/29/2002 10:52:17 AM dbrooks change to more info per escalated ticket meeting 3/29. 4/3/2002 11:35:41 AM tbailey customer went back to 4.7.1 on their production machine. Unless we can duplicate this in house, or ICS can duplicate this on a test machine, I don't see that we'll be able to get the necessary debug. 4/8/2002 2:13:47 PM foconnor Ultimatum that this is fixed by April 15 Martin called in and a gentlemen named Frank is on the line Test system is running 5.0.1 Production system is running 4.7.1 Database is inconsistent on both 4.7.1 and 5.0.2 Now the 4.7.1 is down now. 4/8/2002 3:06:02 PM foconnor Both systems crash every few days nhDbStatus has been run very frequently and I told them not to use it and it is being working on this. Patch 3 no problems patch 8 4.7.1 was there any ingres problems. why does the iiprtqury.log get overwritten. 4/9/2002 9:26:12 AM foconnor -----Original Message----- From: O'Connor, Farrell Sent: Tuesday, April 09, 2002 9:07 AM To: 'support@ics.de' Cc: O'Connor, Farrell Subject: Call ticket 61281: ICSREQ003891 ICS, The problem with nhDbStatus is scheduled to be fixed in Patch 3 of eHealth 5.0.2. There will be no new Patches issued for 4.7.1. There was no ingres changes between Patch 3 and Patch 8 of eHealth 4.7.1 but there may have been in our code. Can you send me the iiprtqry.log file that you have, it can useful for Computer Associates. 4/9/2002 11:38:37 AM tbailey Can we get a one-off for this fix? 4/9/2002 11:59:56 AM rhawkes Yulun, Robin has done work for this, so please coordinate with her. Thanks. -- Rich 4/10/2002 12:29:12 PM yzhang This is the iiprtqry.log from customer's running on nhDbStatus. It looks this log does not comming from nhDbStatus, because when I ran it on my system. the nhDbStatus only access iitables, iimulti_location, and iilocation_info. My recomendation is to have customer rerun nhDbStatus with II_EMBED_SET, the other recommandation is to upgrade to 502 p02 (where we have the fix for logical lock error, and dead lock, which may cause inconsistancy). the customer is currently on 502p1. Can you comment 4/10/2002 7:14:48 PM yzhang perparing oneoff 4/11/2002 1:52:29 PM yzhang please go to ~yzhang/remedy/21601 (on sulfur), and follow the instruction in ~yzhang/remedy/21601/instruction.txt for this oneoff. All libraries and executable has been tested. 4/11/2002 2:03:56 PM yzhang Farrell, I tared everything under ~yzhang/remedy/21601 into 21601.tar, this tar is also located on ~yzhang/remedy/21601, what you need to do is to place this tar file on ftp site, and have customer downloads it. It is better for you to read the instruction.txt so that you can work with customer on this. 4/16/2002 1:48:28 PM mwickham Customer is downloading the 21601.tar.Z file. We are waiting for their feedback. 4/22/2002 12:13:15 PM wburke -----Original Message----- From: Burke, Walter Sent: Monday, April 22, 2002 12:02 PM To: 'support@ics.de' Subject: Ticket # 61281 - nhDbStatus Failure Support, I have been re-assigned this ticket. Please inform as to the status of the following one-off: There is a file called 21601.tar on our ftp site: ftp.concord.com/outgoing/21601.tar which was made for eHealth 5.0.2 Patch 1 or 2 which has a new nhDbStatus executable and the appropriate library files. Sincerely, 4/24/2002 11:28:53 AM yzhang We sent the oneoff to customer about a week ago, and have not heared anything from them yet. I think if we can not get anything back today, then we need de-escalated this one Yulun 4/24/2002 11:32:00 AM wburke -----Original Message----- From: Burke, Walter Sent: Wednesday, April 24, 2002 11:20 AM To: 'support@ics.de' Subject: Ticket # 61281 - nhDbStatus Failure Support, I have been re-assigned this ticket. Please inform as to the status of the following one-off: There is a file called 21601.tar on our ftp site: ftp.concord.com/outgoing/21601.tar which was made for eHealth 5.0.2 Patch 1 or 2 which has a new nhDbStatus executable and the appropriate library files. 4/25/2002 11:55:39 AM wburke -----Original Message----- From: Burke, Walter Sent: Thursday, April 25, 2002 11:44 AM To: 'support@ics.de' Subject: Ticket # 61281 - nhDbStatus Failure Support, I have been re-assigned this ticket. Please inform as to the status of the following one-off: There is a file called 21601.tar on our ftp site: ftp.concord.com/outgoing/21601.tar which was made for eHealth 5.0.2 Patch 1 or 2 which has a new nhDbStatus executable and the appropriate library files. Sincerely, 4/25/2002 11:57:50 AM wburke we cannot de-escalate as this needs to be patched. 5/6/2002 5:20:30 PM rtrei Reassigning to me. This is a repeat of another ticket I am currently holding regarding inconsistencies. 5/7/2002 11:37:42 AM wburke -----Original Message----- From: ICS Product Support [mailto:su< pport@ics.de] Sent: Tuesday, May 07, 2002 11:20 AM To: Burke, Walter Cc: t.froehlich@concord.com; jahn@ics.de Subject: Re: Ticket # 61281 - nhDbStatus Failure ICSREQ003891 Hey Walter, > > Please inform as to the status of the following one-off: > > There is a file called 21601.tar on our ftp site: > ftp.concord.com/outgoing/21601.tar which was made for eHealth 5.0.2 Patch 1 > or 2 > which has a new nhDbStatus executable and the appropriate library files. > Thank you for the patch. Due to the severity of this call and the customer we tested the new nhDbStatus in our environment at ICS. On the installation where we reproduced the error our tests show that the error does not occur anymore, that means we cannot crash our DB or make it inconsistent. That is a good sign !!! With the customer we will decide when we may install the patch in his productive environment. As soon as the customer agrees we will close this call. thank you again for your one-off patch !! Please remind to include the patch in the next official P-release. best regards, 5/10/2002 10:25:01 AM wburke This has been successfully tested. Lets goto tribunal. 5/17/2002 11:38:24 AM rtrei code checked in to patch 3 5/17/2002 4:02:01 PM rsanginario ran Db status menu and nhiDbStatus (command line). saw nothing funny in $NH_HOME/idb/ingres/files. (robin is alson running test on longevity machine). 3/11/2002 5:24:34 PM mpoller From Trend.1000080.log: Error: Append to table SESSION.nht_cdb_elem_speed5300 failed, see the Ingres error log file for more information (E_CO0005 COPY: can't open file 'C:/nethealth/tmp//cdbElemInfo1014912650'. ). Report failed. From Trend.1000083.log: Error: Append to table SESSION.nht_cdb_elem_speed3630 failed, see the Ingres error log file for more information (E_CO0005 COPY: can't open file 'C:/nethealth/tmp//cdbElemInfo1014912650'. ). Report failed. From Trend.1000106.log: Fatal Error: Assertion for 'typeMap' failed, exiting (Every elementId should have a corresponding map in file ./CdbTblElemStats.C, line 1544). Report failed. From Trend.1000113.log: Fatal Error: Assertion for 'typeMap' failed, exiting (Every elementId should have a corresponding map in file ./CdbTblElemStats.C, line 1544). Report failed. From Trend.1000105.log: Fatal Error: Assertion for 'typeMap' failed, exiting (Every elementId should have a corresponding map in file ./CdbTblElemStats.C, line 1544). Report failed. From Trend.1000087.log: Fatal Error: Assertion for 'typeMap' failed, exiting (Every elementId should have a corresponding map in file ./CdbTblElemStats.C, line 1544). Report failed. From Trend.1000110.log: Fatal Error: Assertion for 'typeMap' failed, exiting (Every elementId should have a corresponding map in file ./CdbTblElemStats.C, line 1544). Report failed. From Trend.1000100.log: Fatal Error: Assertion for 'n >= 0 && d' failed, exiting (in file ../GrfData.C, line 2536). Report failed. From Trend.1000111.log: Fatal Error: Assertion for 'n >= 0 && d' failed, exiting (in file ../GrfData.C, line 2536). Report failed. From Trend.1000101.log: Fatal Error: Assertion for 'n >= 0 && d' failed, exiting (in file ../GrfData.C, line 2536). Report failed. From Trend.1000102.log: Fatal Error: Assertion for 'n >= 0 && d' failed, exiting (in file ../GrfData.C, line 2536). Report failed. From Trend.1000108.log: Fatal Error: Assertion for 'n >= 0 && d' failed, exiting (in file ../GrfData.C, line 2536). Report failed. From Trend.1000098.log: Fatal Error: Assertion for 'n >= 0 && d' failed, exiting (in file ../GrfData.C, line 2536). Report failed. From Trend.1000096.log: Fatal Error: Assertion for 'n >= 0 && d' failed, exiting (in file ../GrfData.C, line 2536). Report failed. From Trend.1000099.log: Fatal Error: Assertion for 'n >= 0 && d' failed, exiting (in file ../GrfData.C, line 2536). Report failed. From Trend.1000104.log: Fatal Error: Assertion for 'n >= 0 && d' failed, exiting (in file ../GrfData.C, line 2536). Report failed. From Trend 1000114.log: Fatal Error: Assertion for 'n >= 0 && d' failed, exiting (in file ../GrfData.C, line 2536). Report failed. From Trend.1000103.log: Fatal Error: Assertion for 'n >= 0 && d' failed, exiting (in file ../GrfData.C, line 2536). Report failed. All logs these messages are found in can be found on BAFS:/EscalatedTickets/61000/61044. 3/19/2002 5:29:39 PM kmoylan Seems like a DB issue to be investigated. 3/20/2002 9:43:22 AM rtrei Re-assigning to Rich to give to the db team. Yes, Rich, my guess is that this is db-related. I would have someone pull the errlog.log, get a list of tables, check disk space. Look for space or permissions problems. They may need to work with a reports person in some areas, but this should not break unless the underlying db was in trouble. 3/22/2002 2:17:39 PM yzhang Can you do the following: 1) there are a lot of warning message from dataAnalysis, you need to compare the group information under $NH_HOME/reports to the group information on the console. remove any group and group lists from $NH_HOME/reports , which do not exist on the console 2) have customer run a trend report in the advanced debug mode, and send the debug output file. if you don't know how to do this, check with other support engineer. Thanks Yulun 3/27/2002 4:51:36 PM mmcnally I have the customers database loaded in our lab. I checked the reports dir. for anything that didn't match the console and only found one problems. fixed that and went to run a scheduled report and notice the variable is missing. Placed a variable in and everything ran fine. All Scheduled jobs are missing the variables. After the report runs and you click modify job, the variable is blank again. Yulun said to reassign to Dave. 3/27/2002 5:40:40 PM rtrei Rich-- I'm reassigning this back to the db team. Please bring it to the database team meeting so we can discuss what it says, and what info we need to go after. It willbe a learning opportunity, which we all need :> 4/2/2002 11:30:57 AM mmcnally -----Original Message----- From: McNally, Mike Sent: Tuesday, April 02, 2002 11:20 AM To: Hawkes, Richard Subject: PT 21625 Scheduled jobs are missing all variables Richard, Do you need any info on this one? This customers database is still set up in our lab. Thanks, Mike 4/4/2002 9:19:20 AM cpaschal Hi Richard, The call ticket associated with problem ticket 21625 has been reassigned to me for work. I realize you were off-site yesterday and were not able to work on this issue. However, the customer is getting quite upset with us. A review of the ticket shows that we have the customer's db and reports directory. Is there anything that I can request from the customer that would help speed things along? Thank you for your assistance, Chris 4/4/2002 10:10:36 AM yzhang Colin, This is the problem you showed in the lab, it reassigned back to me. Can you: 1) keep the system, I will look at it this morning, 2)run a trend report in debug mode in your system and send the debug output 3) send me the errlog.log from your system. Robin, do you have something to say about this one? Thanks Yulun 4/4/2002 10:30:48 AM yzhang Rob Lindberg thinks this one should go to Rob Weatherford, because the same problem has been fixed in 5.0 4/5/2002 2:49:59 PM bweatherford I need a little more info on this ticket 1) what actual version are they running? 2) Are they willing to install the latest patch? 3) is the problem with new scheduled reports? 4) both group and element trends? 5) Could support upgrade the machine in the support lab to the latest patch and see it the problem is still there I am leaving this ticket in moreInfo :) 4/11/2002 7:12:51 AM cestep I performed the upgrade in the lab. After the upgrade, I did the following: 1. Opened Schedule Jobs window, Trend reports still don't have variables. 2. Scheduled a new Trend report to see if it would hold for ones added after the upgrade. 3. After scheduling the report, I kept the schedule jobs window open and double < clicked the job I just scheduled. 4. No variables... 5. Clicked OK to close the schedule jobs window. 6. Opened the scheduled jobs window again. 7. Variables are there for my report, as well as all others. 8. Deleted my report 9. Clicked OK to close the scheduler 10. Opened the scheduler again, all variables still there. 11. Repeated step 10 a few times to make sure it stayed, and it did. This problem seems to be resolved by upgrading to 5.0.2. I will contact the customer with the procedure to execute. 4/16/2002 2:35:28 PM cestep Sent procedure to customer again. 4/16/2002 3:43:44 PM don de-esclated since the fix worked in-house and is already in 5.0.2 4/17/2002 10:01:31 AM cestep Reply from the customer: -----Original Message----- From: John Dobos [mailto:johndobos@sti.synovus.com] Sent: Wednesday, April 17, 2002 9:34 AM To: Support@concord.com Subject: Re: Ticket #61044 - Missing trend variables Upgrading did correct the presenting problem. It has, however, opened additional problems that I am dealing with through related problem tickets with Concord. -------------------------------------------------------- I believe this can be set to fixed. 4/17/2002 10:22:34 AM bweatherford Setting ticket to fixed, we can deal with the other tickets seperately Thanks for the help +3/13/2002 9:07:06 AM Betaprogram Alcatel USA Reinhard Pfaffinger 972 519 4943 rpfaffin@usa.alcatel.com nhConvertDb error occurred during installation of beta 5 on Solaris 2.8: . . . SVRMGR> Server Manager complete. Converting database Converting a prior version 5.5 database . . . Error: Database error: ERROR: SQLCODE=0 SQLTEXT=ORA-01237: cannot extend datafile 5 ORA-01110: data file 5: '/apps/nh_. --------------------------------------- . . . This was a standalone, migrated database. There are no errors in the console windows and the database status works. The ehealth servers and Oracle appear to be running fine. Thanks, -------- Original Message -------- Subject: 5.5 install Date: Tue, 12 Mar 2002 17:04:21 -0600 (CST) From: "eHealth 5.5 Login" To: rein.pfaffinger@alcatel.com start installation at Tue Mar 12 16:21:27 CST 2002 Before installing eHealth, you should make sure that: 1) The account from which you will run eHealth exists You will need to supply the following information: 1) The directory eHealth will be installed in 2) The name of the user that will run eHealth ------------------------------------------------------------------------------ eHealth Location --------------------------------------- Where should eHealth be installed? [/ehealth] There appears to be a version of eHealth in /ehealth. You can: 1) stop now. 2) update the eHealth files in /ehealth. 3) continue, without updating any eHealth files. 4) specify another directory. What is your choice? (1|2|3|4) [2] ------------------------------------------------------------------------------ Online eHealth Guides --------------------------------------- You can install online versions of the eHealth guides and make them accessible to users from the Web interface. (You must have an additional 35MB of disk space in the eHealth installation directory.) Do you want to install the online versions of the eHealth Guides? [y] The online guides will be installed in the /ehealth/web/help/doc directory. ------------------------------------------------------------------------------ eHealth User --------------------------------------- From which account will you run eHealth? [ehealth] ------------------------------------------------------------------------------ eHealth Date format --------------------------------------- eHealth can display dates in one of the following formats. 1) mm/dd/yyyy 2) dd/mm/yyyy 3) yyyy/mm/dd 4) yyyy/dd/mm What date format should eHealth use? (1|2|3|4) [1] ------------------------------------------------------------------------------ eHealth Time format --------------------------------------- eHealth can display times in one of the following formats. 1) 12 Hour clock 2) 24 Hour clock What time format should eHealth use? (1|2) [1] ------------------------------------------------------------------------------ Web Reporting Module --------------------------------------- An HTTPD Web server will be installed. Do you want this Web server to start automatically? [y] There appears to be a Web server process running already: At this point you can: 1) exit the install, stop the Web server process yourself, then restart this eHealth installation. 2) continue the install, killing the running Web server process. So you can: 1) quit 2) continue the install, killing the running Web server process What is your choice? (1|2) [2] What port should the Web server use? [8001] ------------------------------------------------------------------------------ Oracle Database Table Setup --------------------------------------- You will now be given the option of whether you want your Oracle database to be created and to have its initial load. Do you want the creation of the oracle database to occur? (y|n)? [n] --------------------------------------- No more questions Take a break! The install will continue for a while (30 minutes or more). Additional time may be needed for database conversion. ********************************************************************* * Interrupting this process will result in an unusable installation * * or database. Please do not attempt to interrupt this without * * first contacting Concord Customer Service. * ********************************************************************* --------------------------------------- Copy eHealth files Moving writable files aside. Copying the eHealth files to /ehealth. 0% 25% 50% 75% 100% ||||||||||||||||||||||||||||||||||||||||||||||||||| Uncompressing files... 0% 25% 50% 75% 100% ||||||||||||||||||||||||||||||||||||||||||||||||||| Starting eHealth verification checks... eHealth checksums verified successfully. The eHealth files have been successfully copied. Checking saved writable files. updating database parameters Edits were needed. Stopping and starting database Oracle will now be shutdown with the 'immediate' option. Oracle Server Manager Release 3.1.7.0.0 - Production Copyright (c) 1997, 1999, Oracle Corporation. All Rights Reserved. Oracle8i Enterprise Edition Release 8.1.7.0.0 - Production With the Partitioning option JServer Release 8.1.7.0.0 - Production SVRMGR> Connected. SVRMGR> Database closed. Database dismounted. ORACLE instance shut down. SVRMGR> Server Manager complete. Database "NHTD" shut down. Oracle Server Manager Release 3.1.7.0.0 - Production Copyright (c) 1997, 1999, Oracle Corporation. All Rights Reserved. Oracle8i Enterprise Edition Release 8.1.7.0.0 - Production With the Partitioning option JServer Release 8.1.7.0.0 - Production SVRMGR> Connected. SVRMGR> ORACLE instance started. Total System Global Area 448254112 bytes Fixed Size 73888 bytes Variable Size 133328896 bytes Database Buffers 314572800 bytes Redo Buffers 278528 bytes Database mounted. Database opened. SVRMGR> Server Manager complete. Converting database Converting a prior version 5.5 database . . . Error: Database error: ERROR: SQLCODE=0 SQLTEXT=ORA-01237: cannot extend datafile 5 ORA-01110: data file 5: '/apps/nh_. --------------------------------------- Configuring the eHealth Web Reporting Module Generating files used to create web report lists... Installing AdvantEDGE View reporting module Installing AdvantEDGE< View for eHealth. --------------------------------------- Configuring product to start automatically at system startup Stopping Concord TrapEXPLODER Concord TrapEXPLODER starting. --------------------------------------- eHealth installation completed successfully. (press RETURN to continue) --------------------------------------- Please take care of the following 5 items: --------------------------------------- 1) The following file(s) have been preserved in /ehealth/changed/5.5.0/install.2: ./nethealthrc.csh ./nethealthrc.sh ./bin/ehealth ./lib/libclntsh.so ./reports/drilldown/liveExAlarmStyles.sde ./reports/drilldown/liveExAlarmStyles.sds ./reports/trafficMatrix/configurations/Applications-For-AllNodes-Text.cfg ./sys/debugLog.cfg ./modules/stratacom/nhiSvImport.fmt If you have changed any of these files, compare them with the newly installed files. Merge any changes you want to preserve into the newly installed files. If you have questions, please contact Customer Support before modifying or moving these files. --------------------------------------- 2) Please check that the Web report access permissions for each user are correct. --------------------------------------- 3) For convenience, you may wish to add the following line to the .cshrc file of the user 'ehealth': source /ehealth/nethealthrc.csh --------------------------------------- 4) To start eHealth, use this command: /ehealth/bin/nethealth --------------------------------------- 5) Please review the ReadMe file for new recommendations on transaction log file size. Please remember the location of the ReadMe file for future reference: /ehealth/ReadMe.5.5.0 Installation complete. 3/14/2002 9:55:16 AM hbui "Bui, Ha" wrote: > > Hi Pfaffinger, > > I have a ticket submitted by you about problem with converting database > while upgrading ehealth. Before I go chase the bug, could you please check > for me to see if your system runs out of disk space or it doesn't support > large files? I looked at the oracle error, and it has a very good change > that your system doesn't support large files. > Thanks, > > _Ha Bui > Ha, The Solaris 2.8 64-bit system supports large files. It was one of the pre-requisites for loading 5.5 and Oracle. And yes, the /apps partition in which the eHealth database resides is at 100% capacity after the beta 4 to beta 5 upgrade. Thanks, 3/14/2002 10:31:48 AM hbui Pfaffinger, Since your system is at 100% capacity, you ran out of disk space. You need to ask your administrator for more disk space and rerun nhConvertDb. Please let me know how it turns out. Thanks, _Ha 3/14/2002 11:50:53 AM dvenuto Per triage this morning, we believe this is a benign error that occurred as a result of an upgrade between Beta versions. Everything appears to be working correctly. Once Ha confirms this, it will be marked as a NoDupl unless something further is found out. 3/15/2002 11:43:31 AM hbui Asked Pfaffinger to move the datafile for nh_index to different disk drive and create the link for the file to see if this will start ehealth. Waiting for reply 3/18/2002 9:01:37 AM dvenuto waiting on moreinfo 3/18/2002 2:15:11 PM hbui Email from Reinhard Pfaffinger I just received the new Oracle disk space requirements. There is no point in continuing with the beta here at Alcatel because I am about 15GB worth of disk space too small to even run a small configuration (standalone, small configuration requires at least 24GB of disk space). 3/19/2002 11:37:47 AM hbui Marked as nobug since user didn't have the system with requirements 3/20/2002 8:31:00 AM rhawkes The user was able to find the disk space, and is now running. Please confirm that the underlying issue in this ticket is resolved. Thanks. 3/20/2002 8:48:55 AM dvenuto SHould be i n moreinfo. 3/20/2002 1:41:13 PM Betaprogram Hi Reinhard, Can you confirm that the underlying issue in this ticket is now resolved as well? Thanks! 3/20/2002 3:02:01 PM Betaprogram yes, the nhConvertDb worked without errors during the beta 5 re-install. 3/20/2002 3:38:10 PM hbui marked as nobug !B3/13/2002 11:25:19 AM mpoller Installation of CA Arcserve backup software caused licensing failure for Ingres. Installed ARC Serv which is a CA product. After it was installed and they rebooted (Windows machine) Ingres would no longer start. This is due to licensing problems with multiple CA products on one machine. After Ingres would not start they removed ARC Serv. The licensing files might have been there if they had not removed ARCServ but they did and they are gone. Found Primus solution TS3103. In that case the cause are resolution were: Upon installation, Ingres sets a registry key that points to its license files, which are required for Ingres to start. The key is the following: HKEY_LOCAL_MACHINE\SOFTWARE\ComputerAssociates\License Within the Registry key, you have "InstallPath", which should be pointed to "C:\CA_LIC". This is changed to "C:\ARCSERVE\CA_LIC" when Arcserve is installed. We must change it back to "C:\CA_LIC" for Ingres to start successfully. They changed it back to C:\CA_LIC and it worked. In this case the Registry key was not changed. Seeing the following error in the NT event logs: E_CL2659_CI_TERMINATEComputer Associates Licensing -2H30 - License Failure. Terminating... Please run the appropriate license program to properly license your product. What appears to have occured is the Arcserve software uses the same licensing files as Ingres. It ended up overwriting the lic98.dat as well as other necessary licensing files for Ingres. The following was done to resolve the issue: - remove all of Arcserve from the machine. - delete all files from the CA_LIC directory. - reboot - sent the customer (customer and I are both running 5.02 on NT) a copy of all files from my own CA_LIC directory. - reboot to ensure the files are reread and to see if Ingres starts automatically as it should. Everything came up fine. Errors from logs and files: All logs and files can be found on BAFS:\EscalatedTickets\61000\61350 From the errlog.log since install of the Arcserve application: CONCORD ::[II\INGRES\105 , ffffffff]: Sat Mar 09 23:03:44 2002 E_SC0129_SERVER_UP Ingres Release II 2.0/9808 (int.wnt/00) Server -- Normal Startup. E_CL2659_CI_TERMINATE 2H30 LRF=2H30, 0001026f7404, DESKTOP, CONCORD, 0 E_CL2659_CI_TERMINATE Computer Associates Licensing -2H30 - License Failure. Terminating... Please run the appropriate license program to properly license your product. LRF=2H30, 0001026f7404, DESKTOP, CONCORD, 0 E_CL2659_CI_TERMINATE Computer Associates Licensing -2H30 - License Failure. Terminating... Please run the appropriate license program to properly license your product. LRF=2H30, 0001026f7404, DESKTOP, CONCORD, 0 E_CL2659_CI_TERMINATE Computer Associates Licensing -2H30 - License Failure. Terminating... Please run the appropriate license program to properly license your product. LRF=2H30, 0001026f7404, DESKTOP, CONCORD, 0 CONCORD ::[ , 00000000]: Sat Mar 09 23:11:06 2002 E_GC0151_GCN_STARTUP Name Server normal startup. E_CL2659_CI_TERMINATE Computer Associates Licensing -2H30 - License Failure. Terminating... Please run the appropriate license program to properly license your product. LRF=2H30, 0001026f7404, DESKTOP, CONCORD, 0 CONCORD ::[ , 00000000]: Sat Mar 09 23:47:13 2002 E_GC0151_GCN_STARTUP Name Server normal startup. E_CL2659_CI_TERMINATE Computer Associates Licensing -2H30 - License Failure. Terminating... Please run the appropriate license program to properly license your product. LRF=2H30, 0001026f7404, DESKTOP, CONCORD, 0 CONCORD ::[ , 00000000]: Sun Mar 10 00:03:57 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun < Mar 10 00:04:03 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:04 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:05 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:07 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:11 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:11 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:12 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:12 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:12 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:13 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:13 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:13 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:14 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:14 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:15 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:17 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:18 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:21 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:22 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:22 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:23 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:23 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:23 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:23 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:24 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:24 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:24 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:25 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:25 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:25 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:26 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:26 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:26 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:27 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:27 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:27 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:28 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:29 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:29 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:30 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:30 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:30 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:31 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:31 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:31 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:32 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:32 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:32 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[< , 00000000]: Sun Mar 10 00:04:32 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:33 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:33 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:33 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:33 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:34 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:34 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:34 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:35 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:36 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 00:04:37 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Sun Mar 10 09:49:19 2002 E_GC0151_GCN_STARTUP Name Server normal startup. E_CL2659_CI_TERMINATE Computer Associates Licensing -2H30 - License Failure. Terminating... Please run the appropriate license program to properly license your product. LRF=2H30, 0001026f7404, DESKTOP, CONCORD, 0 CONCORD ::[ , 00000000]: Mon Mar 11 06:08:51 2002 E_GC0151_GCN_STARTUP Name Server normal startup. E_CL2659_CI_TERMINATE Computer Associates Licensing -2H30 - License Failure. Terminating... Please run the appropriate license program to properly license your product. LRF=2H30, 0001026f7404, DESKTOP, CONCORD, 0 CONCORD ::[ , 00000000]: Mon Mar 11 07:04:59 2002 E_GC0139_GCN_NO_DBMS No DBMS servers (for the specified database) are running in the target installation. CONCORD ::[ , 00000000]: Mon Mar 11 07:24:34 2002 E_GC0152_GCN_SHUTDOWN Name Server normal shutdown. CONCORD ::[ , 00000000]: Mon Mar 11 07:25:20 2002 E_GC0151_GCN_STARTUP Name Server normal startup. E_CL2659_CI_TERMINATE Computer Associates Licensing -2H30 - License Failure. Terminating... Please run the appropriate license program to properly license your product. LRF=2H30, 0001026f7404, DESKTOP, CONCORD, 0 CONCORD ::[ , 00000000]: Mon Mar 11 07:31:37 2002 E_GC0151_GCN_STARTUP Name Server normal startup. E_CL2659_CI_TERMINATE Computer Associates Licensing -2H30 - License Failure. Terminating... Please run the appropriate license program to properly license your product. LRF=2H30, 0001026f7404, DESKTOP, CONCORD, 0 From the NT Application Logs: Error 1: The description for Event ID ( 200 ) in Source ( Ingres_Database ) could not be found. It contains the following insertion string(s): Ingres error: 0 0, Failed to start Ingres installation. Please refer to the error log for further information.. Error 2: 'Computer Associates Licensing -2H30 - License Failure. Terminating... Please run the appropriate license program to properly license your product. LRF=2H30, 0001026f7404, DESKTOP, CONCORD, 0' From the NT System logs: The eHealth service depends on the Ingres Intelligent Database service which failed to start because of the following error: The operation completed successfully. The reason for this problem ticket being logged is two-fold. 1 - Informational so that if other people begin to see this very often, they have information on what causes this issue and how to resolve it. 2 - For engineering to possibly create something that will prevent this from occurring in the future. 4/5/2002 3:42:05 PM rhawkes New releases of our products do not support Ingres. Since we have not seen this problem pervasively, we do not expect to fix it. Support should maintain the workaround information needed to resolve related customer calls. 3/13/2002 11:49:29 AM Betaprogram STATE FARM: -Submitted by Mike Loewenthal, BETA AE - Craig Cies craig.cies.h2o6@statefarm.com 309-763-4561 After licensing a 5.5 Beta 4 with a good 5.5 license (we`ve been using a 5.0 for awhile to do polling) at the end of the day 12-Mar-2002, tried joining the machine to a cluster and we got a bunch of "Internal Error nhiPoller[Net] Pgm nhiPoller[Net]: Expectation for `Bad` failed (Element not found in Db cache in file ../poller.C, line 1891). (cu/cuAssert)" in the system.log file. I called Saeed Honaryar since it might be a DB problem. He never heard of this, he had me check a couple tables and everything looked good. He said I should contact Dave Shepard since it might be a poller problem. I`ve sent Saeed Honaryar, Ravi Pattabhi, Dave Shepard and Brad Carey copies of the system log files. Today, 13-Mar-2002, the machine seems to be running fine after being shutdown for the night and restarted this morning. -Submitted by Mike Loewenthal, the Beta AE- 3/13/2002 11:54:32 AM Betaprogram (Attachments stored with orig email in outlook public folder under Beta Test> active>StateFarm>issues) Hi, Here is an e-mail sent to me about the problem I just submitted. Attached are the log files I sent as well, to be submitted with the problem ticket as well as the orig. e-mail at the bottom I sent to Dave. Thanks. -Mike- -----Original Message----- From: Shepard, Dave To: Hawkes, Richard; Trei, Robin Cc: Loewenthal, Michael; Carey, Brad; Honaryar, Saeed; Pattabhi, Ravi Sent: 3/13/2002 10:42 AM Subject: FW: nhiPoller error related to Db cache error The error being reported by the pollers: Tuesday, March 12, 2002 04:19:36 PM Internal Error nhiPoller[Live] Pgm nhiPoller[Live]: Expectation for 'Bad' failed (Element not found in Db cache in file ../poller.C, line 1891). (cu/cuAssert) This means that the poller read the poller.cfg file, then updated its db cache with a call to cdbGetPollElements(). When it tries to do a lookupElementByName(), it is not finding the elements in the cache. This points to a db problem where the returned caches are not consistent with the latest set of updates. The Db group has been working on several Remedy issues with the same symptoms, normally showing up in the CfgServer with similar attempts to resolve dbIds based on names. A ticket should be created for this and assigned to the Db Group. Cheers, --Dave >-----Original Message----- >From: Loewenthal, Michael >Sent: Wednesday, March 13, 2002 10:25 AM >To: Shepard, Dave; Carey, Brad >Cc: Honaryar, Saeed; Pattabhi, Ravi >Subject: nhiPoller error related to Db cache error > >Dave/Brad, > >I spoke with Saeed Honaryar this morning about an error. We checked >the nh_element_core and nh_element_aux tables and the poller.cfg with >the 3 reporting: > >SQL> select count(*) from nh_element_core; > > COUNT(*) >---------- > 15086 > >SQL> select count(*) from nh_element_aux; > > COUNT(*) >---------- > 15086 > ># grep segment poller.cfg | wc >15087 45260 572153 > >Attached also are the system logs, though this morning everything looks >fine. I shutdown eHealth last night, when I got in at 08:30 this >morning, the servers were running and started at 07:30 (when Craig Cies >of StateFarm got in) and presume he started them. This error is on 5.5 >Beta 4 running on HP-UX 11.0 on a N-Class machine. < > >You can reach me at the desk I am using today, 309-763-4614 or on my >mobile phone 203-820-6940. > >Thanks. > >-Mike Loewenthal- >Beta AE > > > <> 3/13/2002 11:56:31 AM dshepard The error being reported by the pollers: Tuesday, March 12, 2002 04:19:36 PM Internal Error nhiPoller[Live] Pgm nhiPoller[Live]: Expectation for 'Bad' failed (Element not found in Db cache in file ../poller.C, line 1891). (cu/cuAssert) This means that the poller read the poller.cfg file, then updated its db cache with a call to cdbGetPollElements(). When it tries to do a lookupElementByName(), it is not finding the elements in the cache. This points to a db problem where the returned caches are not consistent with the latest set of updates. The Db group has been working on several Remedy issues with the same symptoms, normally showing up in the CfgServer with similar attempts to resolve dbIds based on names. A ticket should be created for this and assigned to the Db Group. 3/13/2002 7:03:41 PM rpattabhi I talked to Dave S and also to Will about this. Aparently this bug was fixed in B5 and is because of a bug with the MV not getting updated. Also State farm no longer have this problem so closing this as a Fixed in B5. -Ravi | 3/14/2002 11:24:12 AM foconnor Customer started to experience rollup failures because of a disk full problem. The customer has freed up about 350 Mbytes of disk space and now the rollups are failing with a core dump. The rollups are run on a day to day basis using a script that runs the rollup one day at a time. core file and other files on //BAFS/escalated tickets/61000/61362 From the customer: 1. I ran the script without changing the rollup-intervals for the first day (1-Feb) it took about 60 minutes for the following days it took about 5 secs no coredumps from nhiRollupDb 2. I changed the rollup-intervals down from 48 days for the first days (1-Feb to 26-Feb) it took few minutes each from 27-2 it took about 20 min each for the last days the nhiRollupDb coredumped I checked the number and names of the tables (with help \g) I found after each run no changes in table names nor in number Are the core dumps due to files getting corrupted, tables getting corrupted or something else. There appears to be plenty of disk space now 3/15/2002 10:21:00 AM foconnor strings core | grep stack Error occurred while dumping stacks Ex_diag_link 0x%x ii.e8pw8.dbms.*.stack_size: 131072 ii.e8pw8.dbms.*.stack_size ii.e8pw8.recovery.*.stack_size overflow of stack in push() During stack unwinding, a destructor must handle its own exception yacc stack overflow backtrack stack overflow: expression generates too many alternatives 3/18/2002 12:03:38 PM shonaryar I looked at the core file and it is stopping at modifyToHeap routine which when i talked over with Robin it is signaling corrupt database. I ask Farrel to send me errlog.log and suggest to customer to destroy, recreate and reload DB. saeed 3/18/2002 12:30:52 PM foconnor Called Stephan (ICS)and Recommend the save, destroy, create load but Stephan says the rollups failed, then the disk filled up, not the disk filled up and rollups failed. So Stephan is concerned that the problem will just reoccur after the load. 3/18/2002 12:32:51 PM foconnor Waiting for errlog.log although customer says there is nothing interesting in it. 3/18/2002 12:58:25 PM shonaryar I talked to Farrel and asked him for dump of nhiRollupDb -Dall and also errlog.log saeed 3/19/2002 9:07:23 AM foconnor -----Original Message----- From: O'Connor, Farrell Sent: Tuesday, March 19, 2002 8:56 AM To: Honaryar, Saeed Cc: Trei, Robin; Wickham, Mark; O'Connor, Farrell Subject: problem ticket 21697 Importance: High Good Morning Saeed, Customer is willing to load a database onto a new 5.0.2 Server. I just need to know if the database is sound enough to give them the go ahead. The debug file for the failed rollup and the errlog.log files are in //BAFS/escalated tickets/61000/61362/March19 The rollup debug file is very large, notepad and wordpad are worthless! This customer is about ready to through us out!! 3/19/2002 3:40:13 PM shonaryar We looked at the errlog.log and there is signs of database corruption and I also confirmed that with Robin, We think upgrading to 5.0 can remedy the problem. saeed 83/14/2002 11:38:00 AM Betaprogram PWC: Don Mount donald.d.mount@us.pwcglobal.com 813-348-7252 Scheduled Clean Nodes failure Although the Cleanup_Nodes.100010.log show complete and no errors. A nhDiag message shows a job step fail. The Conversation poller shows bad polls on all probes after the Clean Nodes runs. I was able to stop and start the Oracle database and Conversation poller shows good poll. A node count of DbStatus Nodes and a sql query of the database do not match. more Cleanup_Nodes.100010.log ----- Job started by User at `03/14/2002 02:14:42 AM`. ----- ----- $NH_HOME/bin/sys/nhiCleanupNodes -hoursOld 0 -hide -hoursOld 24 -hide ----- Scheduled Job ended at `03/14/2002 03:42:41 AM`. ----- nhDiag email 03/14/2002 03:42:39 DbsOcDbJobStepFailed Pgm nhiDbServer: Error: Job step `Cleanup Nodes` failed (the error output was written to /opt/concord/neth/log/Cleanup_Nodes.100010.log Job id: 100010). 03/14/2002 04:00:11 PlrOcDgmNoPollEnd Error: `Traffic Accountant` poll did not complete (poll started at `03/14/2002 02:00:11`). 3/14/2002 1:30:35 PM rpattabhi Donna says that this customer also logged another bug saying did not have permisions creating files. Aparently the user was logged in as root intead of nhuser. Just making a note of this in the bug. Also someone in the TA team should probably first triage this and see if this is a db problem first. Rich will be reassigning this bug. -Ravi 3/15/2002 10:57:08 AM rpattabhi Putting it back in More info. We should retry the test after the latest MV changes are complete. (This will put the FE machines back to the way it was before.) 3/15/2002 10:58:44 AM rpattabhi Pls ignore previous comment. 3/15/2002 6:31:20 PM rpattabhi MoreInfo: I need the following info from the customer. I have already sent mail to the customer. Don: I have looked at nhCleanupNodes and based on what the code is doing I need the following info to proceed with this bug. Can you send me the output of the following commands? 1) sqlplus $NH_USER/$NH_USER < B5 on Monday, 3/11. It's been running fine up until this morning. Database is corrupted. Console won't start. 3/14/2002 4:04:35 PM shonaryar He was out disk space so he deleted some redolog file which really messed up every thing. I asked him to destroy the database and recreate it. saeed (3/15/2002 11:18:04 AM dwaterson Ascii database save fails Customer wants to go from an NT machine to Solaris; however is unable to do this because the ascii save keeps failing: NOTE: Reference Problem Ticket 20733 for 4.7.1. Begin processing (4/3/2002 16:38:36). Copying relevant files (4/3/2002 16:38:36). Unloading the data into the files, in directory: 'D:/DB-save-ascii/05032002.tdb/'. . . Unloading table nh_active_alarm_history . . . Unloading table nh_active_exc_history . . . Unloading table nh_alarm_history . . . Unloading table nh_alarm_rule . . . Unloading table nh_alarm_threshold . . . Unloading table nh_alarm_subject_history . . . Unloading table nh_bsln_info . . . Unloading table nh_bsln . . . Unloading table nh_calendar . . . Unloading table nh_calendar_range . . . Unloading table nh_assoc_type . . . Unloading table nh_elem_latency . . . Unloading table nh_element_class . . . Unloading table nh_elem_outage . . . Unloading table nh_elem_alias . . . Unloading table nh_element_ext . . . Unloading table nh_elem_analyze . . . Unloading table nh_enumeration . . . Unloading table nh_exc_profile_assoc . . . Unloading table nh_exc_subject_history . . . Unloading table ex_tuning_info . . . Unloading table exception_element . . . Unloading table exception_text . . . Unloading table nh_exc_profile . . . Unloading table ex_thumbnail . . . Unloading table nh_exc_history . . . Unloading table nh_list . . . Unloading table nh_list_item . . . Unloading table hdl . . . Unloading table nh_elem_latency . . . Unloading table nh_col_expression . . . Unloading table nh_element_type . . . Unloading table nh_le_global_pref . . . Unloading table nh_elem_type_enum . . . Unloading table nh_elem_type_var . . . Unloading table nh_variable . . . Unloading table nh_mtf . . . Unloading table nh_address . . . Unloading table nh_node_addr_pair . . . Unloading table nh_nms_defn . . . Unloading table nh_elem_assoc . . . Unloading table nh_job_step . . . Unloading table nh_list_group . . . Unloading table nh_run_schedule . . . Unloading table nh_run_step . . . Unloading table nh_job_schedule . . . Unloading table nh_system_log . . . Unloading table nh_step . . . Unloading table nh_schema_version . . . Unloading table nh_stats_poll_info . . . Unloading table nh_import_poll_info . . . Unloading table nh_protocol . . . Unloading table nh_protocol_type . . . Unloading table nh_rpt_config . . . Unloading table nh_rlp_plan . . . Unloading table nh_rlp_boundary . . . Unloading table nh_stats_analysis . . . Unloading table nh_subject . . . Unloading table nh_schedule_outage . . . Unloading table nh_element . . . Unloading table nh_deleted_element . . . Unloading table nh_var_units . . . Unloading the sample data . . . Fatal Internal Error: Ok. (none/) See Relevant Files on BAFS: 61000/61054 3/21/2002 2:18:49 PM foconnor -----Original Message----- From: O'Connor, Farrell Sent: Thursday, March 21, 2002 2:08 PM To: Trei, Robin Cc: O'Connor, Farrell Subject: Problem ticket 21732 Robin, I spoke to you on the phone earlier today about this. You had created a new nhSaveDb.exe for a customer having the same problem on 4.7.1(problem ticket 20733). Can I get a version for 4.8? Customer is on 4.8 Patch 7 WinNT. 3/27/2002 10:25:39 AM hbui The fix is already in 4.8 patch 10. I checked out the file (DuTable.C) in patch 9 and made nhiSaveDb.exe. I asked Farrell to upgrade customer database with patch 9 and sent them nhiSaveDb.exe to try. 4/8/2002 9:21:26 AM foconnor Call ticket 61054 was successful and it took ~80 hours 4/19/2002 10:34:30 AM mwickham -----Original Message----- From: Wickham, Mark Sent: Friday, April 19, 2002 9:44 AM To: Bui, Ha Subject: Escalated Problem Ticket 21732 Ha, Hi! I have a question about this problem ticket. The customer applied the one-off you built for the failing ASCII save after P09. The save completed successfully, however, the nhLoadDb fails in eHealth 5.0.2 Solaris with the following standard output: omega% nhLoadDb -p DB-save-ascii/DbSave-ascii-030402 -ascii -u health ehealth See log file /nms/ehealth/log/load.log for details... Begin processing 10/04/2002 15:17:44. Cleaning out old files (10/04/2002 15:17:44). Copying relevant files (10/04/2002 15:17:44). Error: Append to table temp_nh_schema_version failed, see the Ingres error log file for more information (E_CO0005 COPY: can't open file '/nms/DB-save-ascii/DbSave-ascii-030402.tdb/nvr_b23'. ). Error: Append to table temp_nh_schema_version failed, see the Ingres error log file for more information (E_CO0005 COPY: can't open file '/nms/DB-save-ascii/DbSave-ascii-030402.tdb/nvr_b23'. ). Error: The program nhiLoadDb failed. The load log referenced above says nothing more than, "Load of database 'ehealth' for user 'health' was unsuccessful." What do you suggest... 1) Should I put 21732 back into Assigned with the above comments, or 2) Log a new problem ticket addressing the failed load? Thank you, Mark -----Original Message----- From: Bui, Ha Sent: Friday, April 19, 2002 10:01 AM To: Wickham, Mark Subject: RE: Escalated Problem Ticket 21732 Hi Mark, Could you check with the customer again for the save log file. I think some tables haven't been saved. It looks like the save process has not gone through because it's missing the nvr_b23 file. Check to see if nvr_b23 is in the saved directory. I need to see their log file for saving. If you can, ask them for the binary database, we can try to save it here as ascii. You can re-open the ticket. _Ha 4/22/2002 1:06:16 PM foconnor one off worked for Call ticket 61859 4/24/2002 10:41:17 AM dbrooks move to field test per escalated ticket meeting 4/24. 4/25/2002 8:25:22 AM mwickham The customer who caused this to be escalated, IT-Austria, is not fixed even after applying the one-off. The ASCII save completes, however the nhLoadDb fails on Solaris running 5.0.2. We have received their database on 3 CDs as well as the requested files from Ha on April 19...these are located on BAFS in \escalated tickets\61000\61054\22Apr02. 4/25/2002 9:04:31 AM hbui Mark, Could you please find a system and load the database ( I don't have 502 system with me right now). It won't take long if it fails at loading nvr_* file. Then, send me the log. Thanks. 4/25/2002 11:50:53 AM hbui I looked at the customer's saved database. All the saved files have wrong filename format. They looks like this " filenameascii.zip ". It should have the dot between filenames and ascii. Since the customer has patch7. I asked Mark to upgrade customer to patch9 and apply one-off for saving the database (since the fix is in patch10). Probably the one-off has been applied to the wrong patch, it caused this problem. 4/30/2002 4:13:55 PM mmcnally 61859 is all set... 5/1/2002 11:00:22 AM mmcnally Ha is installing the binary save and will try to perform an ASCII save from that. 5/6/2002 8:46:01 AM mmcnally -----Original Message----- From: McNally, Mike Sent: Monday, May 06, 2002 8:46 AM To: Bui, Ha Subject: PT 21732 Ascii database save fails Hi Ha, Did the Ascii database save go ok? Thanks, Mike 5/7/2002 10:00:49 AM rhawkes From Ha: The ascii database save looks fine. I tried the customer's save on my system and I got the files in the right format (filename.ascii.zip). However, loading to 5.0 database has the problem with nh_element table. I have to check the file to see what is wrong with it. 5/8/2002 4:05:21 PM hbui I installed 5.0.2 and loaded the database on a different syste< m (without loading stat tables). Loading looked fine. I have been playing with the other system. That might have caused the failure of loading customer's database. For now, I think we can tell the customer to go on with saving and loading. 5/9/2002 10:49:42 AM mmcnally Having customer patch up, install one-off and perform the save and load. 5/9/2002 10:57:01 AM dbrooks should be in more info 5/10/2002 12:29:19 PM mmcnally Ha, the customer went through our steps of patching up and applying the one-off and the load fail with error again. omega% nhLoadDb -p DB-save-ascii/DbSave-ascii-030402 -ascii -u health ehealth See log file /nms/ehealth/log/load.log for details... Begin processing 10/04/2002 15:14:17. Cleaning out old files (10/04/2002 15:14:17). Copying relevant files (10/04/2002 15:14:17). Error: '/nms/DB-save-ascii/DbSave-ascii-030402.tdb/nvr_b23' is not a file name. Error: '/nms/DB-save-ascii/DbSave-ascii-030402.tdb/nvr_b23' is not a file name. Error: The program nhiLoadDb failed. Was your test load successful? Thanks, Mike 5/10/2002 2:59:23 PM hbui There's still something wrong with the saving. The nvr_b23 file for schema_version table is still missing. Mike, could you get the save log file. 5/13/2002 11:06:28 AM dbrooks ha should have put this in more info on 5/10 5/16/2002 9:40:33 AM mmcnally Customer is performing the new ascii save with the one-off installed in the proper order this time. Awaiting a status. 5/20/2002 10:53:37 AM dbrooks field test per escalated ticket meeting 5/20. 5/22/2002 10:15:23 AM mmcnally -----Original Message----- From: McNally, Mike Sent: Wednesday, May 22, 2002 10:15 AM To: Bui, Ha Subject: 21732 Ascii database save fails Hi Ha, The customer has a few quetions on the save and load you did. 1) how long did it take? 2) during the time of the save and load, how much of a gap will they have? The way I undersatnd it is during the save we take a snapshot of the rlp_boundary table and we only save the tables that exist there. Any other data we collect gets dropped if it is not seen in the rlp_boundary table. Is this correct? Thanks, Mike 6/10/2002 5:11:53 PM hbui The customer can rollup the data they have to get rid of some nh_stats_. So it's the customer call. 7/9/2002 2:57:24 PM mwickham Associated call ticket 61859 (Computer Science Corporation) has confirmed the one-off worked. 7/17/2002 11:49:52 AM hbui The fix is in patch 3. Marked as fixed. :3/15/2002 2:28:55 PM Betaprogram UNISYS: David Lerchenfeld 734-737-7202 Clean install with Beta 5- new migration. when Creating Oracle DB got error: Not enough usable space was specified. He has 2 4gig disks the DB he's trying to migrate is 1&1/2 gig and he selected 'small' because he has less than 3,000 elements 3/15/2002 2:30:20 PM Betaprogram david.lerchenfeld@unisys.com 3/15/2002 3:51:12 PM shonaryar This is because we require about 15 G space for small configuration. saeed 3/15/2002 4:14:32 PM Betaprogram Saeed, Anil and Joel, Interesting information from Dave at Unisys! He's the customer I ran around for this afternoon. He's doing a clean install with Beta 5+ new migration. when Creating Oracle DB got error: Not enough usable space was specified. He has 2 4gig disksthe DB he's trying to migrate is 1&1/2 gig and he selected 'small' because he has less than 3,000 elements. Well, it bothered him enough for him to play with - and he got around it! He sent the workaround below. I just thought you might be interested. Melissa, Please update ticket #21744 with Dave's memo below. Thanks! Donna -----Original Message----- From: Lerchenfeld, David W. [mailto:david.lerchenfeld@unisys.com] Sent: Friday, March 15, 2002 3:39 PM To: 'Amaral, Donna' Cc: 'Concord_Beta'; 'Betaprogram'; 'BetaGroup'; Penglase, William Subject: Size Error Additional Info Donna, After talking to you on the phone and looking around at the code I came up with the following short term fix for our initial error during the creation of the Oracle Database Error: Not enough useable space was specified. Looking into the nhCreateDb script I found what I consider to be a possible oversight with the allocation of space. I will attempt to explain, we are currently using 2 - 4 gig drives which where the same ones we used for beta 4. The following code from nhCreateDb allocates the sizes of the different structures of the Oracle Database. After poking around a bit I discovered that the largest size allocation is i_nh_index_size which is attempting to allocate 3580 disk space. Which, can be seen, is in the last position of the tableSpaceSize array. set -A tableSpaceSize i_redolog1_size i_redolog2_size i_archlogs_size i_nh_data01_size i_nh_data02_size i_system_dbf_size i_nh_rollback_size i_nh_temp_size i_nh_index_size The code in nhCreateDb attempts to alternate between the disk directories given by the user, in our case 2, but again it could be more. The main problem I can see with this code is that the LARGEST allocation of disk space is being performed LAST. I would propose that the LARGEST disk allocation be performed 1st. After that the directories can still be rotate on to allocate the remain space. I have modified the code as shown below and the scripts in NOW continuing to create the Database. set -A tableSpaceSize i_nh_index_size i_redolog1_size i_redolog2_size i_archlogs_size i_nh_data01_size i_nh_data02_size i_system_dbf_size i_nh_rollback_size i_nh_temp_size The biggest problem I had with this code is that the total space check PASSED for the oracle create but the ineffective allocation of the space caused it to fail. Please pass this on to your engineering staff for their review. I will continue with the migration using this modification as we are still eager to test this software. Dave 3/18/2002 11:38:53 AM dvenuto We will be reviewing this algorithm to determine is there are issues with the overall algorithm (for the majority of customers). 3/18/2002 5:35:18 PM shonaryar Changed table space orientation for tablespacesizes in wsCore/oracle/scripts/nhCreateDb.sh saeed 3/19/2002 8:33:38 AM jdorden Test Note: --------- You need to run nhCreateDb on all platform. On one platform make sure you have less than 3.5 GB of space available on every disk. NH_INDEX needs at leased 3.6 GB of spaces so with this test it should discontinue installation. 3/19/2002 12:09:59 PM Betaprogram ALCATEL CAN: Empowered Networks for Alcatel Canada Jayde Hanley jayde.hanley@empowerednetworks.com 613-290-5404 After successfully completing migration from 5.0.2 to 5.5 B5, I stopped and restarted the server. The Console gave these messages: Tuesday, 03/19/2002 10:34:31 AM System Event The server is not running, starting server . . . Tuesday, 03/19/2002 10:36:46 AM Error (Console) Unable to connect to `port 5056` (unable to connect to the server). Tuesday, 03/19/2002 10:36:46 AM System Event Console initialization failed. After getting this error, I stopped and restarted the eHealth server again. This time the server started properly and polling began. 3/19/2002 4:01:34 PM rnaik Can you try this a couple of times and see if we can reproduce the problem. Stop ehealth. From the command line cd $NH_HOME/bin/sys nhiServer start -Dall -Dt >& nhiServer.txt Check if the servers come up successfully or not. If it fails to, send in the nhiServer.txt to us. 3/19/2002 4:26:17 PM Betaprogram Requested info from customer 3/20/2002 8:49:27 AM rlindberg this is really a server issue, so I'll re-assign to Rupa who did the first pass analysis. 3/20/2002 10:16:54 AM Betaprogram Hi, I tried to follow the instructions given to me regarding this ticket: Stop ehealth. >From the command line cd $NH_HOME/bin/sys nhiServer start -Dall -Dt >& nhiServer.txt Check if the servers come up successfully or not. If it fails< to, send in the nhiServer.txt to us. Here is what I get: $ ./nhiServer start -Dall -Dt >& nhiServer.txt nhiServer.txt: bad number It looks like the syntax given to me is bad. Could someone check it and reply? Jayde 3/20/2002 12:36:05 PM boutotte Jayde and I just talked. He's is going to stop and start the servers the normal way a few times. The expectation is that this won't be reproducable. 3/20/2002 2:11:09 PM mfintonis This is to confirm that the problem with the server stopping and restarting when changing the polling rate of the Normal polling interval is consistent. I tried it again by changing the Normal polling rate from 10 minutes to 15 minutes. The same messages as those listed below were again seen. Jayde Jayde Hanley said: > More on this ticket: > > I changed the Normal polling interval from 5 minutes to 10 minutes. Here is > what happened. > > Wednesday, 03/20/2002 10:16:56 AM > The 'Statistics Poller' has been updated with configuration changes. > > Wednesday, 03/20/2002 10:18:56 AM System Event > The poller configuration has been modified, re-initializing . . . > > Wednesday, 03/20/2002 10:19:10 AM System Event > The poller configuration has been modified, re-initializing . . . > > Wednesday, 03/20/2002 10:20:02 AM Warning (Message Server) > Attempt to release unlocked resource Poller Config (All). > > Wednesday, 03/20/2002 10:22:30 AM Fatal Internal Error (Statistics Poller) > Unable to get a handle for database 'ehealth'. (plr/Poller::setupPolling) > > Wednesday, 03/20/2002 10:22:35 AM > The server stopped unexpectedly or database load completed, restarting > . . . > > Wednesday, 03/20/2002 10:22:50 AM System Event > Initializing the console with the server on 'ebeta' . . . > > Wednesday, 03/20/2002 10:23:44 AM System Event > Console initialization complete. > > Wednesday, 03/20/2002 10:23:35 AM > Controller has started. Product version is 5.5.0.0.1177.. > > Wednesday, 03/20/2002 10:27:28 AM > Poller initialization complete (Conversations Poller). > > Wednesday, 03/20/2002 10:28:30 AM > Poller initialization complete (Import Poller). > > Wednesday, 03/20/2002 10:32:25 AM > Poller initialization complete (Statistics Poller). > > Wednesday, 03/20/2002 10:32:25 AM > Poller initialization complete (Fast Live Poller). > > Polling then resumes normally. > > Jayde 3/20/2002 2:47:46 PM boutotte You had reported a bug in Beta4 that ended up being bad config data. You worked with Saeed on this. It ended up that there were references to elements with IP address 255.255.255.255. You fixed this up in the Oracle database and everything was running on Beta 4. The database you have now was re-migrated from 5.0 again using Beta 5. We're thinking that maybe it's possible that you have the same problem again, since the 5.0 Ingres version contains the original problem. Can you check this and if it's true can you make the same fixes? If it's not true, we need to get the current database. 3/20/2002 3:02:26 PM Betaprogram I have tried stopping and restarting the server a few times and I've been unable to reproduce the error messages seen for this ticket. Jayde 3/21/2002 9:10:31 AM wzingher Closing as no dupl following the above update from Jayde. Looks to have been a timing issue at startup, which happens occasionally. 3/21/2002 10:22:35 AM boutotte We are still looking at the following issue. Re-opening. Wednesday, 03/20/2002 10:22:30 AM Fatal Internal Error (Statistics Poller) Unable to get a handle for database 'ehealth'. (plr/Poller::setupPolling) 3/25/2002 10:56:23 AM Betaprogram -----Original Message----- From: Jayde Hanley [mailto:jayde@empowerednetworks.com] Sent: Monday, March 25, 2002 10:16 AM To: damaral@concord.com Cc: concord_beta@concord.com Subject: I am placing a copy of Alcatel's pre-migration database in ftp.concord.com/incoming/alcatel_canada. The saved db has been split into 10 MB files, so you will find the files are named alcatel-db-502-feb15.tar.gz.af15.tar.gz.aa to alcatel-db-502-feb15.tar.gz.aq. Join all these files together before unpacking. It will be about 30 minutes from now before all the files are there. Jayde 3/25/2002 11:39:14 AM rnaik Wednesday, 03/20/2002 10:22:30 AM Fatal Internal Error (Statistics Poller) Unable to get a handle for database 'ehealth'. (plr/Poller::setupPolling) I just spoke to Jayde, the error above (John B put this error in on March 21st) happens every time they change the polling rate. 3/27/2002 10:29:22 AM boutotte The error is coming from the PollTimer constructor. It does a _db=DuDatabase::singleInstance(). Seems the poller lost the handle to the database. Transfering to Dave S. as per discussion in the 5.5 managers meeting. Targeting to 5.5 Patch 1. 3/27/2002 12:47:16 PM dshepard I don't see why this would be poller related. The problem is in the database layer. Reassigning to the Db group. 3/27/2002 3:41:45 PM rhawkes Assigning to Ravi to research. 3/27/2002 4:10:28 PM dvenuto Marking postponed for P1. 5/6/2002 6:20:38 PM rtrei We need to test what happens when the polling rate is changed. If it truly loses the db handle, then this must go out as a patch. Otherwise, it can be closed. 9/13/2002 2:42:34 PM hbui Setup 5.5 system, it didn't matter how many times the poller rate was changed. The error didn't occur. Close as nodup. 3/19/2002 1:22:47 PM Betaprogram EARTHLINK: -Submitted my Mike Loewenthal, the Beta AE- Cody West cwest1@corp.earthlink.net 626-296-5783 After upgrading from Beta 4 to Beta 5 at Earthlink, when trying to start eHealth, the servers would not start because eHealth was stating that the DB Version was incorrect. After running nhConvertDb, the DB version was reported correctly, so it looks as thought nhConvertDb was not run during the upgrade. -Submitted my Mike Loewenthal, the Beta AE- 3/19/2002 1:43:34 PM Betaprogram This is actually an existing ticket 21304 that was NoDupl up until now. Reopened that ticket. Closing this ticket as a Repeat of 21304. =3/20/2002 2:52:10 PM dwaterson More than one "all" group in group lists Customer has a group called All and also two other groups called all. When trying to modify the groups listed as all the console crashes. Able to delete the groups; however they reappear later. On BAFS: 61000/61117 there are screenshots displaying this behavior. I am not able to reproduce this on my 5.0.2 box. NOTES: If the all groups are deleted they are recreated automatically. Seems to be occuring with Lan, Wan and Lan/Wan groups only. When trying to modify the all groups, the console closes This was a fresh install of 5.02 on Windows 2000 server sp1. The database was saved from the 4.8 version that was previously installed on the server. It was then loaded into 5.02. Copy of the 5.0.2 database is on BAFS 3/22/2002 5:52:52 PM hbui I fixed the converting/loading database so that all the "all" , "summary" subjects are renamed to something else. That will prevent the select for group/group list based on name and type from failing. The fix is in patch 3 3/20/2002 6:07:22 PM rrick Server crashed on 3/16/02 at 4:31 am in the morning, leaving a core file in $NH_HOME. Performing strings on core file showed that Live Exceptions processing was occuring at time of crash. Checked the $NH_HOME/idb/ingres/errlog.log file and see that error occured in processing DMF errlog.log extract SERVER10::[23365 , 000001fe]: Sat Mar 16 04:00:56 2002 E_DM004A_INTERNAL_ERROR Internal DMF error detected. SERVER10::[23365 , 000001fe]: Sat Mar 16 04:00:56 2002 E_QE009C_UNKNOWN_ERROR Unexpected error received from another facility. Check the server error log. SERVER10::[23365 , 000001fe]: Sat Mar 16 04:00:56 2002 E_DM004A_INTERNAL_ERROR Internal DMF error detected. SERVER10::[23365 , 000001fe]: Sat Mar 16 04:00:56 2002 E_QE009C_UNKNOWN_ERRO< R Unexpected error received from another facility. Check the server error log. SERVER10::[23365 , 000001fe]: Sat Mar 16 04:00:56 2002 E_DM004A_INTERNAL_ERROR Internal DMF error detected. SERVER10::[23365 , 000001fe]: Sat Mar 16 04:00:56 2002 E_QE009C_UNKNOWN_ERROR Unexpected error received from another facility. Check the server error log. SERVER10::[23365 , 000001fe]: Sat Mar 16 04:00:56 2002 E_QE009C_UNKNOWN_ERROR Unexpected error received from another facility. Check the server error log. SERVER10::[23365 , 000001fe]: Sat Mar 16 04:00:56 2002 E_DM004A_INTERNAL_ERROR Internal DMF error detected. SERVER10::[23365 , 000001fe]: Sat Mar 16 04:00:56 2002 E_QE009C_UNKNOWN_ERROR Unexpected error received from another facility. Check the server error log. SERVER10::[23365 , 000001fe]: Sat Mar 16 04:00:56 2002 E_QE009C_UNKNOWN_ERROR Unexpected error received from another facility. Check the server error log. SERVER10::[23365 , 000001fe]: Sat Mar 16 04:00:56 2002 E_DM004A_INTERNAL_ERROR Internal DMF error detected. SERVER10::[23365 , 000001fe]: Sat Mar 16 04:00:56 2002 E_SC0122_DB_CLOSE Error closing database. Name: nethealth Owner: ehealth SERVER10::[23365 , 000001fe]: Sat Mar 16 04:00:56 2002 E_SC010D_DB_LOCATION Database Location Name: $default Physical Specification: /opt/eh/ingdb/ingres/data/default/nethealth Flags: 00000003 SERVER10::[23365 , 000001fe]: Sat Mar 16 04:00:56 2002 E_DM004A_INTERNAL_ERROR Internal DMF error detected. SERVER10::[23365 , 000001fe]: Sat Mar 16 04:00:56 2002 E_DM004A_INTERNAL_ERROR Internal DMF error detected. SERVER10::[23365 , 000001fe]: Sat Mar 16 04:00:56 2002 E_DM004A_INTERNAL_ERROR Internal DMF error detected. SERVER10::[23365 , 0000001d]: Sat Mar 16 04:00:58 2002 E_DM004A_INTERNAL_ERROR Internal DMF error detected. SERVER10::[23365 , 0000001d]: Sat Mar 16 04:00:58 2002 E_QE009C_UNKNOWN_ERROR Unexpected error received from another facility. Check the server error log. SERVER10::[23365 , 0000001d]: Sat Mar 16 04:00:58 2002 E_QE009C_UNKNOWN_ERROR Unexpected error received from another facility. Check the server error log. SERVER10::[23365 , 0000001d]: Sat Mar 16 04:00:58 2002 E_DM004A_INTERNAL_ERROR Internal DMF error detected. SERVER10::[23365 , 0000001d]: Sat Mar 16 04:00:58 2002 E_SC0122_DB_CLOSE Error closing database. Name: nethealth Owner: ehealth SERVER10::[23365 , 0000001d]: Sat Mar 16 04:00:58 2002 E_SC010D_DB_LOCATION Database Location Name: $default Physical Specification: /opt/eh/ingdb/ingres/data/default/nethealth Flags: 00000003 SERVER10::[23365 , 0000001d]: Sat Mar 16 04:00:58 2002 E_SC0221_SERVER_ERROR_MAX Error count for server has been exceeded. SERVER10::[23365 , 0000001d]: Sat Mar 16 04:00:58 2002 E_PS0501_SESSION_OPEN There were open sessions when trying to shut down the parser facility. SERVER10::[23365 , 0000001c]: An error occurred in the following session: All files on bafs/61000/61735 including the core file 3/21/2002 12:26:55 PM rmitchel This sounds like an ingres problem 3/28/2002 5:44:40 PM yzhang Let's take care the stack dump problem first, have customer upgrade to 48P9, and at the time double dbms.stack_size parameter through cbf 4/2/2002 5:31:35 PM yzhang Any progress on this one? 4/8/2002 9:14:41 AM jnormandin - All set. Call closed 3/21/2002 11:14:39 AM wburke Fetch takes too long due to resource contention and bad table writes: Fetches kick off, about 3-4 hours into the fetch the follwoing appears in the errlog.log VARCHVP0::[43916 , 00001230]: Thu Mar 21 04:30:48 2002 E_DM0042_DEADLOCK Resource deadlock. VARCHVP0::[43916 , 00001230]: Thu Mar 21 04:30:48 2002 E_QE002A_DEADLOCK Deadlock detected. VARCHVP0::[43916 , 00001230]: Thu Mar 21 04:31:24 2002 E_DM901C_BAD_LOCK_REQUEST Error requesting a lock on mode: 00000005 for the lock list: 0000005B. VARCHVP0::[43916 , 00001230]: Thu Mar 21 04:31:24 2002 E_DM901F_BAD_TABLE_CREATE Error creating the table:nh_stats0_1015768799_ix1 in database:ehealth. 3/21/2002 11:16:03 AM wburke -----Original Message----- From: Burke, Walter Sent: Thursday, March 21, 2002 11:05 AM To: ts_esc_leads Cc: Fanning, Susan; Jarvis, Rob Subject: Please escalate # 21849 - Bank of America Revenue Impact 300k Fetch causes resource contention issues. E_DM901C_BAD_LOCK_REQUEST Error requesting a lock on mode: 00000005 for the lock list: 0000005B. VARCHVP0::[43916 , 00001230]: Thu Mar 21 04:31:24 2002 E_DM901F_BAD_TABLE_CREATE Error creating the table:nh_stats0_1015768799_ix1 in database:ehealth. 3/21/2002 12:11:59 PM wburke Yulun, to forward nhFetchDb which will not index stats. - run nhiIndex Manually. - 3/27/2002 3:34:07 PM wburke -----Original Message----- From: Trei, Robin Sent: Wednesday, March 27, 2002 3:23 PM To: Burke, Walter; Zhang, Yulun; Wolf, Jay Cc: Piergallini, Anthony; Keville, Bob Subject: RE: Bank Of America Status Walter-- As I mentioend, I reviewed all the DP code and did not see anyplace that we tried to recreate an index outside of hte one area we disabled. So I don't have a good answer for why this is happening. It is obvious that nhiIndexStats is one of the players, and it is most likely that the nhFetchDb code is the other player, although we have no proof as such. My best guess is that it is trying to recreate the index while we have the table locked (in nhFetchDb) for the bulkload, or something of that sort. If this is the case, then putting the cdbGetDdlLock in nhFetchDb (which will synchronize the DDL changes) will resolve this problem. Yulun is testing those changes now. 3/28/2002 12:13:32 PM rhawkes The work we have identified for this is for Yulun to merge Ha's and Robin's fixes and deliver the result to Band of America and NTL. 3/28/2002 12:20:20 PM yzhang Walter, As we talked; do the following on the central site 1) do a delete from following tables: nh_deleted_element_core nh_deleted_element_aux nh_elem_alias nh_elem_assoc nh_elem_latency nh_element_aux nh_element_core nh_group nh_group_list nh_group_list_members nh_group_members 2) overwrite poller.cfg with poller.init 3) stop then start nhServer 4) make sure there is at least one good remoteSave with -g and -gl on each of the remote sites 5) run fetch 4/1/2002 2:23:00 PM yzhang Don, I noticed that the call ticket for 21849 has been closed. I think I am going to close the associated problem ticket. Yulun 3/22/2002 11:13:48 AM Betaprogram DCMA: Joe Banks jbanks@hq.dcma.mil phone =703-428-1548 fax =703-428-1395 I am unable to load a saved database from the eHealth console or from the eHealth server`s command line. This has been a recurring problem for me. The following has been copied from the load.log file: Begin processing 3/22/2002 10:00:18 AM. Cleaning out old files (3/22/2002 10:00:18 AM). Copying relevant files (3/22/2002 10:00:18 AM). Error: Database error: ERROR: SQLCODE=-1012 SQLTEXT=ORA-01012: not logged on . Error: The program nhiLoadDb failed. Refer to log c:/ehealthdb/ehealth.tdb/oracle_rman for more details.. Error: nhiLoadDb failed. 3/22/2002 11:16:13 AM Betaprogram Hi Robin, I've opened a Bug Ticket on the web site and I'm following it up with this email and some attached files so that I can expedite things. I had to once again rebuild my eHealth server yesterday because when I upgraded to Beta 5 my database size tripled and I ran completely out of hard drive space. I saved the database from my old system (which is running Beta 4) and tried to load it this morning and I received errors. The following error message was copied from the load.log file; Begin processing 3/22/2002 10:00:18 AM. Cleaning out old files (3/22/2002 10:00:18 AM). Copying relevant files (3/22/2002 10:00:18 AM). Error: Database error: ERROR: SQLCODE=-1012 SQLTEXT=ORA-01012: no< t logged on . Error: The program nhiLoadDb failed. Refer to log c:/ehealthdb/ehealth.tdb/oracle_rman for more details.. Error: nhiLoadDb failed. I also tried to load it from the eHealth Console and from the eHealth server command line and the errors that I received are shown in the files that I've attached below in this email. I will await your response. <> <> <> <> <> Office# (703) 428-1548 Cell# (301) 529-9359 Joe **Attachments stored in original email from site in outlook public folder: Engineering>Beta Test> 5.5> Beta Sites (active)>DCMA>Issues/Bugs 3/22/2002 11:18:01 AM Betaprogram Rich-- Given my current escalated ticket load I think this should be reassigned to Ravi. We need to make sure that this isn't the same problem as previously-- if it is, then we can assume that the 1 in 10,000 times estimate is grossly incorrect. I think it is a red flag item. 3/25/2002 3:31:38 PM rpattabhi I am waiting for customer feed back for this. Ct. is trying to load a B4 binary save into B5 and this will not work. This is a known issue. It happens because there are more redo logs in B5 than B4. I am planning to give the customer a workaround once he confirms that this is infact the problem. The info I need: is the result of this command. I need the file datfile.log that it creates. -Ravi -----Original Message----- From: Pattabhi, Ravi Sent: Monday, March 25, 2002 10:20 AM To: Pattabhi, Ravi Subject: RE: Beta 5 DB Bug(21861)DCMA sh sqlplus neth/neth > datfile.log 2>&1 <nhiListTypes.exe | grep resp 104092 respPath "Response Path" "Response Path" Then, we tried to dump the labels for this element type, but got nothing back. E:\eHealth\bin\sys>nhiLabelTblDump.exe -e 104092 E:\eHealth\bin\sys> On the poller itself, there are no polling errors on any of the reponse paths, and a Trend report reveals 100% good polls. When running LT in normal polling mode (not fast polling) against an response path element that we know we have data for in normal polling mode (i.e. we can run a Trend report against a path and we do get data for, say, transaction, bytes, response time, etc...), when you try to start the LT chart with 2 hours of data, you see a pop up that says that LT could not get the historical data. Then, as LT tries to poll the element, it continuously says: Poll< not ready. Will retry at local time