26/07/2024

KCET Cutoff Parser - Explained

Pain Point

The default cutoff list from KEA is huge and confusing. Creating a custom table for specific filters would help in the Application process. I had this pain point while I was applying and recently when my brother was.

The idea here is to act like a filter and get consistent data. But the major setback is parsing the damn cutoff PDF, which is extremely tight, has less padding, and in weird table format.

Tech Stack

I wanted to keep tech-stack simple, as I wanted to focus more on improving PDF table parsing. The app frontend is built using React + Primereact and the backend is running onFrappe connected to MariaDB, hosted on DigitalOcean.

Solution

Now I started with tabula-py, but it missed a lot of rows. For starters, it had issues where for a single row that has let’s say 20 columns, 4 column’s data will be in 1 row, then some x columns data will be in 2 rows, and so on. No pattern, pure randomness. Tried writing a logic for it, but also explored other alternatives simultaneously.

Screenshot from 2024-06-30 17-49-48.png

Screenshot from 2024-06-30 17-50-35.png

value aggregation problem screenshots

The second approach was using OCR, the output was decent but required some cleanup. But it in some way resolved the value aggregation issue. So the idea was to use OCR and convert PDF tables to CSV sheets.

But there’s one more problem, if you don’t scale down the table(in Google Sheets) and export, you’ll have value aggregation again, which seems like a nightmare to solve. So I scaled it down to 70-80% and then exported it in A3 Landscape sheet format with custom page breakups.

Now using this CSV sheet, I was able to populate my DB. There are issues still with consistency, looking for other alternatives. But this has to go with MVP. I want something to be live and get critical feedback on my work. Don’t wanna waste time figuring out an optimized approach and then release. I thought if I could set up the pipeline, I would be able to improve data quality later on.

I did the same; after 2 months of effort, the app is in the MVP stage. So I have planned to deploy it. I plan to improve things, like exporting the cutoff sheet as CSV and exposing CRUD APIs to allow development on top of existing KCET Cutoff Data.

Some notes for future development:

I hate to say this, but I don’t want to continue with Frappe. The alternative would be to use Flask or Django.
Add more cutoff data for engineering. Also try to scrape and dump Medical stream’s cutoff data

Note:

Right now, it is not open for open-source development. I have to update the development backend for debugging/testing. Planning to add that once the project gets some traction or users.