This study presents a novel approach to extract parallel data from a comparable English-Punjabi corpus addressing the scarcity of parallel corpora for this language pair. Unlike previous research this approach focuses on creating high-precision parallel data using minimal resources. The data is sourced from diverse domains including Wikipedia articles TDIL's noisy parallel sentences and Gyan Nidhi reports. The methodology consists of three phases: extracting and aligning documents translating Punjabi texts into English using OpenNMT-py and calculating content similarity through three measures-Euclidean Distance Cosine and Jaccard. These algorithms are run individually and then their results are integrated to improve accuracy. By combining the scores of all three measures the system achieves a precision of 93% and an accuracy of 86%. This integrated approach significantly enhances parallel data extraction for English-Punjabi corpora and holds potential for improving Statistical Machine Translation (SMT) models.
Piracy-free
Assured Quality
Secure Transactions
Delivery Options
Please enter pincode to check delivery time.
*COD & Shipping Charges may apply on certain items.