Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements/additions to Subtitle Line parsing #32

Merged
merged 6 commits into from
Jul 19, 2021
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 17 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
## SubtitlesParser

Universal subtitles parser which aims at supporting all subtitle formats.
Universal subtitles parser which aims at supporting parsing for all subtitle formats, and writing some.
For more info on subtitles formats, see this page: http://en.wikipedia.org/wiki/Category:Subtitle_file_formats

It's available on Nuget:
> Install-Package SubtitlesParser

For now, 7 different formats are supported:
For now, 7 different formats are supported for parsing:
* MicroDvd https://github.com/AlexPoint/SubtitlesParser/blob/master/SubtitlesParser/Classes/Parsers/MicroDvdParser.cs
* SubRip https://github.com/AlexPoint/SubtitlesParser/blob/master/SubtitlesParser/Classes/Parsers/SrtParser.cs
* SubStationAlpha https://github.com/AlexPoint/SubtitlesParser/blob/master/SubtitlesParser/Classes/Parsers/SsaParser.cs
Expand All @@ -15,6 +15,9 @@ For now, 7 different formats are supported:
* WebVTT https://github.com/AlexPoint/SubtitlesParser/blob/master/SubtitlesParser/Classes/Parsers/VttParser.cs
* Youtube specific XML format https://github.com/AlexPoint/SubtitlesParser/blob/master/SubtitlesParser/Classes/Parsers/YtXmlFormatParser.cs

And 2 formats are supported for writing:
* SubRip https://github.com/AlexPoint/SubtitlesParser/blob/master/SubtitlesParser/Classes/Writers/SrtWriter.cs
* SubstationAlpha https://github.com/AlexPoint/SubtitlesParser/blob/master/SubtitlesParser/Classes/Writers/SsaWriter.cs

### Quickstart

Expand Down Expand Up @@ -42,3 +45,15 @@ using (var fileStream = File.OpenRead(pathToSrtFile)){
var items = parser.ParseStream(fileStream);
}
```

#### Specific writer

You can use a specific writer to write a List of SubtitleItems to a stream.
```csharp
var writer = new SubtitlesParser.Classes.Writers.SrtWriter();
using (var fileStream = File.OpenWrite(pathToSrtFile)) {
writer.WriteStream(fileStream, yourListOfSubtitleItems);
}
```

Async versions are also available (ie `writer.WriteStreamAsync(fileStream, yourListOfSubtitleItems);` instead).
26 changes: 14 additions & 12 deletions SubtitlesParser/Classes/Parsers/SrtParser.cs
Original file line number Diff line number Diff line change
Expand Up @@ -19,17 +19,17 @@ namespace SubtitlesParser.Classes.Parsers
/// 00:00:15,000 --> 00:00:18,000
/// At the left we can see...[12]
/// </summary>
public class SrtParser: ISubtitlesParser
public class SrtParser : ISubtitlesParser
{

// Properties -----------------------------------------------------------------------

private readonly string[] _delimiters = { "-->" , "- >", "->" };
private readonly string[] _delimiters = {"-->", "- >", "->"};


// Constructors --------------------------------------------------------------------

public SrtParser(){}
public SrtParser() {}


// Methods -------------------------------------------------------------------------
Expand All @@ -40,8 +40,8 @@ public List<SubtitleItem> ParseStream(Stream srtStream, Encoding encoding)
if (!srtStream.CanRead || !srtStream.CanSeek)
{
var message = string.Format("Stream must be seekable and readable in a subtitles parser. " +
"Operation interrupted; isSeekable: {0} - isReadable: {1}",
srtStream.CanSeek, srtStream.CanSeek);
"Operation interrupted; isSeekable: {0} - isReadable: {1}",
srtStream.CanSeek, srtStream.CanSeek);
throw new ArgumentException(message);
}

Expand Down Expand Up @@ -81,6 +81,8 @@ public List<SubtitleItem> ParseStream(Stream srtStream, Encoding encoding)
{
// we found the timecode, now we get the text
item.Lines.Add(line);
// strip formatting by removing anything within curly braces or angle brackets, which is how SRT styles text according to wikipedia (https://en.wikipedia.org/wiki/SubRip#Formatting)
item.PlaintextLines.Add(Regex.Replace(line, @"\{.*?\}|<.*?>", string.Empty));
AlexPoint marked this conversation as resolved.
Show resolved Hide resolved
}
}

Expand All @@ -105,7 +107,7 @@ public List<SubtitleItem> ParseStream(Stream srtStream, Encoding encoding)
throw new FormatException("Parsing as srt returned no srt part.");
}
}

/// <summary>
/// Enumerates the subtitle parts in a srt file based on the standard line break observed between them.
/// A srt subtitle part is in the form:
Expand All @@ -129,9 +131,9 @@ private IEnumerable<string> GetSrtSubTitleParts(TextReader reader)
// return only if not empty
var res = sb.ToString().TrimEnd();
if (!string.IsNullOrEmpty(res))
{
yield return res;
}
{
yield return res;
}
sb = new StringBuilder();
}
else
Expand Down Expand Up @@ -187,4 +189,4 @@ private static int ParseSrtTimecode(string s)
}

}
}
}
55 changes: 45 additions & 10 deletions SubtitlesParser/Classes/Parsers/SsaParser.cs
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
using System.IO;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using SubtitlesParser.Classes.Utils;

namespace SubtitlesParser.Classes.Parsers
Expand Down Expand Up @@ -43,8 +44,8 @@ public List<SubtitleItem> ParseStream(Stream ssaStream, Encoding encoding)
if (!ssaStream.CanRead || !ssaStream.CanSeek)
{
var message = string.Format("Stream must be seekable and readable in a subtitles parser. " +
"Operation interrupted; isSeekable: {0} - isReadable: {1}",
ssaStream.CanSeek, ssaStream.CanRead);
"Operation interrupted; isSeekable: {0} - isReadable: {1}",
ssaStream.CanSeek, ssaStream.CanRead);
throw new ArgumentException(message);
}

Expand All @@ -53,11 +54,22 @@ public List<SubtitleItem> ParseStream(Stream ssaStream, Encoding encoding)

var reader = new StreamReader(ssaStream, encoding, true);

// default wrap style to none if the header section doesn't contain a wrap style definition (very possible since it wasn't present in SSA, only ASS)
SsaWrapStyle wrapStyle = SsaWrapStyle.None;

var line = reader.ReadLine();
var lineNumber = 1;
// read the line until the [Events] section
while (line != null && line != SsaFormatConstants.EVENT_LINE)
{
if (line.StartsWith(SsaFormatConstants.WRAP_STYLE_PREFIX))
{
// get the wrap style
// the raw string is the second array item after splitting the line at `:` (which we know will be present since it's
// included in the `WRAP_STYLE_PREFIX` const), so trim the space off the beginning of that item, and parse that string into the enum
wrapStyle = line.Split(':')[1].TrimStart().FromString();
}

line = reader.ReadLine();
lineNumber++;
}
Expand All @@ -81,7 +93,7 @@ public List<SubtitleItem> ParseStream(Stream ssaStream, Encoding encoding)
line = reader.ReadLine();
while (line != null)
{
if(!string.IsNullOrEmpty(line))
if (!string.IsNullOrEmpty(line))
{
var columns = line.Split(SsaFormatConstants.SEPARATOR);
var startText = columns[startIndexColumn];
Expand All @@ -93,15 +105,38 @@ public List<SubtitleItem> ParseStream(Stream ssaStream, Encoding encoding)
var start = ParseSsaTimecode(startText);
var end = ParseSsaTimecode(endText);

// TODO: split text line?
if (start > 0 && end > 0 && !string.IsNullOrEmpty(textLine))
{
List<string> lines;
switch (wrapStyle)
{
case SsaWrapStyle.Smart:
case SsaWrapStyle.SmartWideLowerLine:
case SsaWrapStyle.EndOfLine:
// according to the spec doc:
// `\n` is ignored by SSA if smart-wrapping (and therefore smart with wider lower line) is enabled
// end-of-line word wrapping: only `\N` breaks
lines = textLine.Split(@"\N").ToList();
break;
case SsaWrapStyle.None:
// the default value of the variable is None, which breaks on either `\n` or `\N`

// according to the spec doc:
// no word wrapping: `\n` `\N` both breaks
lines = Regex.Split(textLine, @"(?:\\n)|(?:\\N)").ToList(); // regex because there isn't an overload to take an array of strings to split on
break;
default:
throw new ArgumentOutOfRangeException();
}

var item = new SubtitleItem()
{
StartTime = start,
EndTime = end,
Lines = new List<string>() { textLine }
};
{
StartTime = start,
EndTime = end,
Lines = lines,
// strip formatting by removing anything within curly braces, this will not remove duplicate content however, which can happen when working with signs for example
PlaintextLines = lines.Select(subtitleLine => Regex.Replace(subtitleLine, @"\{.*?\}", string.Empty)).ToList()
};
items.Add(item);
}
}
Expand All @@ -121,7 +156,7 @@ public List<SubtitleItem> ParseStream(Stream ssaStream, Encoding encoding)
{
var message = string.Format("Couldn't find all the necessary columns " +
"headers ({0}, {1}, {2}) in header line {3}",
SsaFormatConstants.START_COLUMN, SsaFormatConstants.END_COLUMN,
SsaFormatConstants.START_COLUMN, SsaFormatConstants.END_COLUMN,
SsaFormatConstants.TEXT_COLUMN, headerLine);
throw new ArgumentException(message);
}
Expand Down
9 changes: 9 additions & 0 deletions SubtitlesParser/Classes/SubtitleItem.cs
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,16 @@ public class SubtitleItem
/// End time in milliseconds.
/// </summary>
public int EndTime { get; set; }
/// <summary>
/// The raw subtitle string from the file
/// May include formatting
/// </summary>
public List<string> Lines { get; set; }
/// <summary>
/// The plain-text string from the file
/// Does not include formatting
/// </summary>
public List<string> PlaintextLines { get; set; }


//Constructors-----------------------------------------------------------------
Expand Down
1 change: 1 addition & 0 deletions SubtitlesParser/Classes/Utils/SsaFormatConstants.cs
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ public static class SsaFormatConstants
public const char SEPARATOR = ',';
public const char COMMENT = ';';

public const string WRAP_STYLE_PREFIX = "WrapStyle: ";
public const string DIALOGUE_PREFIX = "Dialogue: ";

public const string START_COLUMN = "Start";
Expand Down
61 changes: 61 additions & 0 deletions SubtitlesParser/Classes/Utils/SsaWrapStyle.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
namespace SubtitlesParser.Classes.Utils
{
/// <summary>
/// Represents a Wrap Style used by Advanced SSA
/// </summary>
// Note: the spec doc doesn't actually specify a name, just a number and a description, so I took some creative liberties
public enum SsaWrapStyle
{
/// <summary>
/// Smart wrapping, lines are evenly broken
/// </summary>
Smart = 0,
/// <summary>
/// End-of-line word wrapping, only \N breaks
/// </summary>
EndOfLine = 1,
/// <summary>
/// No word wrapping, \n \N both breaks
/// </summary>
None = 2,
/// <summary>
/// Same as Smart, but the lower line gets wider
/// </summary>
SmartWideLowerLine = 3
}

/// <summary>
/// Extension methods for parsing to a wrap style
/// </summary>
public static class SsaWrapStyleExtensions
{
/// <summary>
/// Parse a string into a wrap style
///
/// Invalid input strings will return `SsaWrapStyle.None`
/// </summary>
/// <param name="rawString">A string representation of a wrap style value</param>
/// <returns>A SsaWrapStyle corresponding to the value parsed from the input string</returns>
public static SsaWrapStyle FromString(this string rawString) =>
int.TryParse(rawString, out int rawInt) == false ?
SsaWrapStyle.None: // basically an arbitrary choice, could also throw an exception here instead
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to draw your attention to this, as I'm not sure what you'd prefer the library do when it fails to parse the value. If you'd rather have it throw an exception, I can change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also looking at this again and realizing the == false is redundant since it's a ternary operator instead of the if statement I originally had...

Will leave it unless you want it changed, or if I go in there anyways to change to throwing an exception.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the two options have their pros and cons.
You can leave it like this for the moment. I can't quite get cases where things could go wrong but if I have some in the future, I'll consider if the other option (throwing an exception) is more adapted.

FromInt(rawInt);

/// <summary>
/// Parse an integer into a wrap style
///
/// Integers outside the range of valid wrap styles will default to `SsaWrapStyle.None`
/// </summary>
/// <param name="rawInt">An integer inside the range of values representing a wrap style</param>
/// <returns>A SsaWrapStyle corresponding to the integer value specified</returns>
public static SsaWrapStyle FromInt(this int rawInt) =>
rawInt switch
{
0 => SsaWrapStyle.Smart,
1 => SsaWrapStyle.EndOfLine,
2 => SsaWrapStyle.None,
3 => SsaWrapStyle.SmartWideLowerLine,
_ => SsaWrapStyle.None
};
}
}
8 changes: 5 additions & 3 deletions SubtitlesParser/Classes/Writers/ISubtitlesWriter.cs
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,15 @@ public interface ISubtitlesWriter
/// </summary>
/// <param name="stream">the stream to write to</param>
/// <param name="subtitleItems">the SubtitleItems to write</param>
void WriteStream(Stream stream, IEnumerable<SubtitleItem> subtitleItems);

/// <param name="includeFormatting">if formatting codes should be included when writing the subtitle item lines. Each subtitle item must have the PlaintextLines property set.</param>
void WriteStream(Stream stream, IEnumerable<SubtitleItem> subtitleItems, bool includeFormatting = true);

/// <summary>
/// Asynchronously writes a list of SubtitleItems into a stream
/// </summary>
/// <param name="stream">the stream to write to</param>
/// <param name="subtitleItems">the SubtitleItems to write</param>
Task WriteStreamAsync(Stream stream, IEnumerable<SubtitleItem> subtitleItems);
/// <param name="includeFormatting">if formatting codes should be included when writing the subtitle item lines. Each subtitle item must have the PlaintextLines property set.</param>
Task WriteStreamAsync(Stream stream, IEnumerable<SubtitleItem> subtitleItems, bool includeFormatting = true);
}
}
Loading