Data

NFHS-4

Domain NameHealth
Files Shared5
Sheets Shared10
Files Ingested5
Sheets Ingested10
Ingestion %100%
Landing Tables26
Staging Tables10
Average Rating ** (Difficult) 
Processing Error Rate10%
Record Error Rate20%
File Format.DCT, .DAT
LGD Code IncludedYES
Raw data S3 PathNFHS4/
Pipeline PathNFHS4/

From given raw data we ingested 26 tables in landing, and we consolidated 10 tables to staging. Following are the challenges we have faced – Some attributes in csv file does not present in Map files. No. of columns is high. Data at household granularity, some columns with “_” in them need transformation and will be combined into one while adding into staging this will reduce the error rate

NFHS-5

Domain NameHealth
Files Shared2
Sheets Shared4
Files Ingested2
Sheets Ingested4
Ingestion %100%
Landing Tables10
Staging Tables3
Average Rating* (Very Difficult)
Processing Error Rate0%
Record Error Rate0%
File Format.MAP, .csv
LGD Code IncludedYES
Raw data S3 PathNFHS5/
Pipeline PathNFHS5/

From given raw data we ingested 10 tables in landing, and we consolidated 3 tables to staging. Following are the challenges we have faced – file sizes high. No. of columns is high. Data at household granularity, some columns with “_” in them need transformation and will be combined into one while adding into staging this will reduce the error rate.

PDS  – Ahara kfcsc

Domain NameHealth
Files Shared503
Sheets Shared11,220
Files Ingested503
Sheets Ingested11,220
Ingestion %100%
Landing Tables10396
Staging Tables0
Average Rating*** (Medium)
Processing Error Rate0%
Record Error RateIn Progress
File FormatExcel
LGD Code IncludedNO
Raw data S3 PathAhara-Kfcsc/
Pipeline PathPL_P0_AHARA.ipynb

From given raw data we ingested 10396 tables in landing. Following are the challenges we have faced – Ahara has multiple sheets with unusual text added in the heading filenames are not following proper naming conversions, Bangalore north file needed manual intervention to clean.

PDS  – Malnutrition data

Domain NameHealth
Files Shared3
Sheets Shared3
Files Ingested3
Sheets Ingested3
Ingestion %100%
Landing Tables4
Staging Tables3
Average Rating***** (Very Easy)
Processing Error Rate0%
Record Error Rate0%
File FormatExcel
LGD Code IncludedYES
Raw data S3 Pathrawdatamalnutrition/
Pipeline PathPL_P0_MALNUTRITION.ipynb

From given raw data we ingested 4 tables in landing, and we consolidated 3 tables to staging. Following are the challenges we have faced – empty columns and Different columns and diffrent spelling of taluk from that of LDG codes file

Karnataka at a Glance 2020-21

Domain NameHealth
Files Shared16
Sheets Shared16
Files Ingested16
Sheets Ingested16
Ingestion %100%
Landing Tables0
Staging Tables11
Average Rating*** (Medium)
Processing Error Rate0%
Record Error Rate0%
File FormatExcel.xls
LGD Code IncludedNO
Raw data S3 PathKAG_2020_21/
Pipeline PathKAG_2020_21/

From given raw data we consolidated 11 tables to staging. Following are the challenges we have faced – Multi header titles, Separation of Kannada words from English

Karnataka at a Glance 2020-21

Domain NameEducation
Files Shared23
Sheets Shared23
Files Ingested23
Sheets Ingested23
Ingestion %100%
Landing Tables0
Staging Tables15
Average Rating*** (Medium)
Processing Error Rate0%
Record Error Rate0%
File FormatExcel.xls
LGD Code IncludedNO
Raw data S3 PathKAG_2020_21/
Pipeline PathKAG_2020_21/

From given raw data we consolidated 15 tables to staging. Following are the challenges we have faced – Multi header titles, Separation of Kannada words from English.

SECC

Domain NameEducation
Files SharedExcel – 7
CSV – 31
Sheets SharedExcel – 7
CSV – 31
Files IngestedExcel – 7
CSV – 31
Sheets IngestedExcel – 7
CSV – 31
Ingestion %100%
Landing Tables40
Staging Tables7
Average RatingExcel – **** (Easy)
CSV – **** (Easy)
Processing Error Rate0%
Record Error Rate0%
File FormatExcel, .csv
LGD Code IncludedYES
Raw data S3 PathSECC/
Pipeline PathSECC/

From given raw data we ingested 40 tables in landing, and we consolidated 7 tables to staging. Following are the challenges we have faced – large file sizes, multi-level titles, Haveri file had records taking multiple lines, some of columns are completely null need to add some default value

Karnataka at a Glance 2020-21

Domain NameAgriculture
Files Shared16
Sheets Shared16
Files Ingested16
Sheets Ingested16
Ingestion %100%
Landing Tables0
Staging Tables19
Average Rating*** (Medium)
Processing Error Rate0%
Record Error Rate0%
File FormatExcel.xls
LGD Code IncludedNO
Raw data S3 PathKAG_2020_21/
Pipeline PathKAG_2020_21/

From given raw data we consolidated 19 tables to staging. Following are the challenges we have faced – Multi header titles, Separation of Kannada words from English.

CCE data of Directorate of Economics and Statistics (DES)

Domain NameAgriculture
Files Shared17
Sheets Shared32
Files Ingested17
Sheets Ingested32
Ingestion %100%
Landing Tables33
Staging Tables1
Average Rating**** (Very Easy)
Processing Error Rate0%
Record Error RateIn Progress
File FormatExcel
LGD Code IncludedNO
Raw data S3 PathCCE/
Pipeline PathPL_P0.0_CCE.ipynb

From given raw data we ingested 33 tables in landing, and we consolidated 1 table to staging. Following are the challenges we have faced – Long column names empty columns and multiple spaces.

Fertilizers data 2014-18

Domain NameAgriculture
Files Shared2
Sheets Shared13
Files Ingested2
Sheets Ingested13
Ingestion %100%
Landing Tables11
Staging Tables6
Average Rating3* (Medium)
Processing Error Rate0%
Record Error Rate0%
File FormatExcel
LGD Code IncludedYES
Raw data S3 PathFertilisers data 2014-2018/
Pipeline PathFertilizer/

From given raw data we ingested 11 tables in landing, and we consolidated 6 tables to staging. Following are the challenges we have faced – Multi level headers

Irrigation district wise 1954-2018

Domain NameAgriculture
Files Shared1
Sheets Shared1
Files Ingested1
Sheets Ingested1
Ingestion %100%
Landing Tables1
Staging Tables2
Average Rating**** (Easy)
Processing Error Rate0%
Record Error Rate0%
File FormatExcel
LGD Code IncludedYES
Raw data S3 PathIrrigation – Districtwise/
Pipeline PathPL_0_Irrigation_districtwise_data_2019.ipynb

From given raw data we ingested 1 table in landing, and we consolidated 2 tables to staging. Following are the challenges we have faced – Multi level headers

Irrigation taluk wise

Domain NameAgriculture
Files Shared2
Sheets Shared2
Files Ingested2
Sheets Ingested2
Ingestion %100%
Landing Tables1
Staging Tables2
Average Rating**** (Easy)
Processing Error Rate0%
Record Error Rate0%
File FormatExcel
LGD Code IncludedYES
Raw data S3 PathIrrigation – Talukwise/
Pipeline PathPL_0_Irrigation_Talukwise_data_2018_2019.ipynb

From given raw data we ingested 1 table in landing, and we consolidated 2 tables to staging. Following are the challenges we have faced – Multi level headers

Geographical land use taluk wise 2017-2018

Domain NameAgriculture
Files Shared2
Sheets Shared2
Files Ingested2
Sheets Ingested2
Ingestion %100%
Landing Tables0
Staging Tables1
Average Rating**** (Easy)
Processing Error Rate0%
Record Error Rate0%
File FormatExcel
LGD Code IncludedYES
Raw data S3 PathGeographical land Use Data Taluk Wise 2017 & 2018/
Pipeline PathPL_0_geographical_Taluk_land_data_2017_2018.ipynb

From given raw data we consolidated 1 table to staging. Following are the challenges we have faced – Multi level headers

Geographical land use district wise 2007-2018

Domain NameAgriculture
Files Shared1
Sheets Shared1
Files Ingested2
Sheets Ingested1
Ingestion %100%
Landing Tables0
Staging Tables1
Average Rating**** (Easy)
Processing Error Rate0%
Record Error Rate0%
File FormatExcel
LGD Code IncludedYES
Raw data S3 PathGeographical Land Use data Distict wise 2007 – 2018/
Pipeline PathPL_0_geographical_land_District_data.ipynb

From given raw data we consolidated 1 table to staging. Following are the challenges we have faced – Multi level headers

Principal crops data

Domain NameAgriculture
Files Sharedpdf – 14
jpeg – 873
Sheets Sharedpdf – 2,400
jpeg – 873
Files Ingestedpdf – 0
jpeg – 50
Sheets Ingestedpdf – 0
jpeg – 50
Ingestion %In Progress
Landing Tablespdf – 0
Staging Tablespdf – 0
Average Ratingpdf – * (Very Difficult)
jpeg – * (Very Difficult)
Processing Error Ratejpeg – 5%
Record Error RateIn Progress
File Format.pdf, .jpg
LGD Code IncludedNO
Raw data S3 PathPdf- Agri_Principal_Crops PDF to_Excel/
Excel- Agri_Principal_Crops_Image_To_Excel/
Pipeline PathPdf- Principal_Crops_PDF/
Excel- Principal_Crops_ImagetoExcel/

From given images and pdf files we extracted the data. Following are the challenges we have faced – pdfs are converted to excel, image files are converted to excel. Still working on these. Some images are not being recognized well my AWS Textract service

Operation holdings area data

Domain NameAgriculture
Files Shared207
Sheets Shared207
Files Ingested207
Sheets Ingested207
Ingestion %100%
Landing Tables1031
Staging Tables2
Average Rating* (Very Difficult)
Processing Error Rate0%
Record Error Rate0%
File FormatExcel
LGD Code IncludedYES
Raw data S3 PathOperation Holdings Area Data/
Pipeline PathAgri_Operation/

From given raw data we ingested 1031 tables in landing, and we consolidated 2 tables to staging. Following are the challenges we have faced – Multiple Tables per sheet

Time Series Area, Production Yield data – District wise

Domain NameAgriculture
Files Shared3
Sheets Shared93
Files Ingested3
Sheets Ingested93
Ingestion %100%
Landing Tables90
Staging Tables4
Average Rating** (Difficult)
Processing Error Rate0%
Record Error Rate0%
File FormatExcel
LGD Code IncludedYES
Raw data S3 PathTime Series Area, Production, Yield Data District-wise/
Pipeline PathTIME_SERIES/

From given raw data we ingested 90 tables in landing, and we consolidated 4 tables to staging. Following are the challenges we have faced – Multi level headers

New data received from Agriculture dept

Domain NameAgriculture
Files Shared12
Sheets Shared46
Files Ingested6
Sheets Ingested42
Ingestion %82.75%
Landing Tables18
Staging Tables0
Average Rating*** (Medium)
Processing Error Rate0%
Record Error Rate0%
File FormatExcel
LGD Code IncludedNO
Raw data S3 PathNew Data Received from Agriculture Dep/
Pipeline PathPL_P0.0_NDRA_DEPT.ipynb

From given raw data we ingested 18 tables in landing. Following are the challenges we have faced – multi headers corrupt text nonaligned columns

Domain NameAgriculture
Files Shared1
Sheets Shared30
Files Ingested1
Sheets Ingested30
Ingestion %100%
Landing Tables155
Staging Tables1
Average Rating** (Difficult) 
Processing Error Rate0%
Record Error Rate0%
File FormatExcel
LGD Code IncludedYES
Raw data S3 PathData Received from Manju Nath sir 25-01-2021/
Pipeline PathAgri_Census/

From given raw data we ingested 155 tables in landing, and we consolidated 1 table to staging. Following are the challenges we have faced – Multiple tables in worksheets

Crop Cutting Experiment Data (DAY TAY SAY)

Domain NameAgriculture
Files Shared3
Sheets Shared41
Files Ingested3
Sheets Ingested41
Ingestion %100%
Landing Tables7
Staging Tables0
Average Rating* (Very Difficult)
Processing Error Rate10%
Record Error RateIn Progress
File Format.TXT
LGD Code IncludedNO
Raw data S3 PathAgri_Crop_Cutting_Text_to_Excel/
Pipeline PathKRS_TEXTfiles/

From given raw data we ingested 7 tables in landing. Following are the challenges we have faced – Multiple tables per text file

Crop Cutting Data\Data

Domain NameAgriculture
Files Shared42
Sheets Shared42
Files Ingested33
Sheets Ingested33
Ingestion %78.57%
Landing Tables33
Staging Tables0
Average Rating4* (Easy)
Processing Error Rate10%
Record Error RateIn Progress
File FormatExcel
LGD Code IncludedNO
Raw data S3 PathData/
Pipeline PathAGRICULTURE_DATA/

From given raw data we ingested 33 tables in landing. Following are the challenges we have faced – Multiple tables per sheet, no proper formatting of column names some files have unusual formats i.e., why not yet ingested error rate will decrease when we ingest them.

KRS All Year 2016-2018-19

Domain NameAgriculture
Files Shared18
Sheets Shared18
Files Ingested18
Sheets Ingested18
Ingestion %100%
Landing Tables18
Staging Tables0
Average Rating* (Very Difficult)
Processing Error Rate0%
Record Error RateIn Progress
File Format.pdf
LGD Code IncludedNO
Raw data S3 PathAgri_Crop_Cutting_KRS PDF_to_Excel/
Pipeline PathKRS_PDF/

From given raw data we ingested 18 tables in landing. Following are the challenges we have faced – Multiple Tables per pdf files

Master data files

Domain NameMaster Files
Files Shared42
Sheets Shared42
Files Ingested42
Sheets Ingested42
Ingestion %100%
Landing Tables0
Staging Tables42
Average Rating***** (Very Easy)
Processing Error Rate0%
Record Error Rate0%
File FormatExcel
LGD Code IncludedYES
Raw data S3 PathPL_P0_master_raw/
Pipeline Pathmaster/

From given raw data we consolidated 42 tables to staging. Following are the challenges we have faced – Long column names empty columns and multiple white spaces.

Raw data files (Batch 1)

Domain NameMaster Files
Files Shared13
Sheets Shared15
Files Ingested13
Sheets Ingested15
Ingestion %100%
Landing Tables0
Staging Tables14
Average Rating**** (Easy)
Processing Error Rate0%
Record Error Rate0%
File FormatExcel
LGD Code IncludedYES
Raw data S3 PathPL_P0_master_raw/
Pipeline Pathraw/

From given raw data we consolidated 14 tables to staging. Following are the challenges we have faced – Long column names empty columns and multiple spaces